
yeah I’m not the one who generated the url list but I’ve also been getting a lot without a downloadable document. I’m going to start on one of the url lists posted here soon

yeah I’m not the one who generated the url list but I’ve also been getting a lot without a downloadable document. I’m going to start on one of the url lists posted here soon

alrighty, I’m currently in the middle of the archive.org upload but I can transfer the chunks I already have over to a different machine and do it there with a new IP

age gate > page not found

I messaged you on the other site; I’m currently getting a Could not determine Content-Length (got None) error

No worries, thank you!
edit: I’ll start on that url list (randomized) tomorrow, my run from the previously generated url list is still going (currently 75.6k files)

this method is not working for me anymore

I’m waiting for /u/Kindly_District9380 's version but I’ve been slowly working backwards on this in the meantime https://archive.org/details/dataset9_url_list

I’ve got that one too, maybe we should compare dataset 12 versions too

I’m using a partial download I already had and not the 48gb version but I will be gathering as many chunks as I can as well. Thanks for making this

I’ll get the first set (42k files in 31G) uploading as soon as I get it zipped up. it’s the one least likely to have any new files in it since I started at the beginning like others but it’s worth a shot
edit 01FEB2026 1208AM EST - 6.4/30gb uploaded to archive.org
edit 01FEB2026 0430AM EST - 13/30gb uploaded to archive.org; scrape using a different url set going backwards is currently at 75.4k files
edit 01FEB2026 1233PM EST - had an internet outage overnight and lost all progress on the archive.org upload, currently back to 11/30gb. the scrape using a previous url set seems to be getting very few new files now, sitting at 77.9k at the moment

maybe archive.org? that way they can be torrented if others want to attempt their own merging techniques? either way it will be a long upload, my speed is not especially good. I’m still churning through one set of urls that is 1.2M lines, most are failing but I have 65k from that batch so far.

looking forward to your torrent, will seed.
I have several incomplete sets of files from dataset 9 that I downloaded with a scraped set of urls - should I try to get them to you to compare as well?
yep, impossible to know