@kongstrong

kongstrong@lemmy.world · 2 hours ago

nice. Kinda feeling like we can’t be sure whether our URL lists are ever exhaustive enough or that the DOJ might just let a large part of the dataset go dark

kongstrong@lemmy.world · 2 hours ago

Awesome, I don’t really understand what’s happening but I’m also running it (also doing it for the presumably exact same 48GB torrent, but I’m supposed to do that right?)

kongstrong@lemmy.world · 2 hours ago

I’ve been checking your URLs but it seems you’ve got a lot without a downloadable document attached?

kongstrong@lemmy.world · edit-2 3 hours ago

Would love to help still from my PC on dataset 9 specifically. Any way we can exchange progress so I won’t start with downloading files you already have downloaded?

E: just started scraping starting from page 18330 (as you mentioned you ended around 18333), hoping I can fill in the remaining 4000-ish pages

Update 2 (1715UTC): just finished scraping up until the page 20500 limit you set in the code. There are 0 new files in the range between 18330-20500 compared to the ones you already found. So unless I did something wrong, either your list is complete or the DOJ has been scrambling their shit (considering the large number of duplicate pages, I’m going with the second explanation).

Either way, I’m gonna extract the 48GB and 100GB torrent directories now and try to mark down which of the files already exist within those torrents, so we can make an (intermediate) list of which files are still missing from them