@redbarinternet

redbarinternet@lemmy.world · 52 minutes ago

I’ll check it out. I was mostly working to scrape all links since there was no direct download. Then, downloading those. So, I mostly did data collection on what was there first.

redbarinternet@lemmy.world · edit-2 1 hour ago

ADD MY DISCORD FOR MORE DISCUSSION: redbarinternet

Dataset 11 Scraped URLs info

Dataset 10 Scraped URLs info

Dataset 9 Scraped URLs info

ADD MY DISCORD FOR MORE DISCUSSION: redbarinternet

redbarinternet@lemmy.world · edit-2 1 hour ago

I also fully believe that dataset 9 is being throttled. Its fragmented. Its broken. The indexing is wrong, etc

redbarinternet@lemmy.world · edit-2 1 hour ago

ADD MY DISCORD FOR MORE DISCUSSION: redbarinternet

Dataset 9 is cooked:

Anyone got discord? I have been scraping the website collecting all the links. There is not even 1million links available. I have a full dashboard for it:

Theres no where close to 3.5 million files even in the full link collection. I have scraped all possible links too.

Current streak is how many page in a row on dataset 9 that it scraped before finding a new page. My threshold for stopping is set at 4000 duplicate pages in a row.

Why?

Yes this means what you think. Each streak represents at least 5 separate times where there were greater than 400 duplicate pages before new data was found. These are unique instances too so that means at one point you would have went through 436 pages before a new one, then another was 816 pages before a new one, and so on.

Total counts based on links available that were scraped at the time my database tracked them: This indicates that we have the potential to download ~900k files out of the current document range that we should have 1 to 2,731,783