@Kindly_District9380

Kindly_District9380@lemmy.world · edit-2 13 hours ago

hey sorry I got super distracted with building a data mapper, but I have the version here, just gov stopped responding to my requests, even though I was quite gracefully requesting the pages:

UPDATE DATASET 9 Files List:

Progress:

Scraped 529,334 file URLs from Justice .gov (pages 0-18333, ~89% of index)

Downloading individual files: 30K files / 41GB so far
Also grabbed the 86GB DataSet_9.tar.xz torrent (~500K files) - extracting now
Uploaded my URL index to Archive.org - 529K file URLs in JSON format if anyone wants to help download the remaining files.

link: https://archive.org/details/epstein-dataset9-index

The link is live and shows the 75.7MB JSON file available for downloa

Kindly_District9380@lemmy.world · 18 hours ago

coming up with that right now, check my comment below.

I made this script and slowly about to finish the crawl, it’s close to 20k+ pages.

https://pastebin.com/zbF0Rmfx

It should be done in less than 1-2 hour, and I will upload it to Archive. org

Kindly_District9380@lemmy.world · edit-2 13 hours ago

Superb, I have 1-8, 11-12.

Only remaining 10 (to complete - downloading from Archive.org now)

Dataset 9 is the biggest. I ended up writing a parser to go through every page on justice.gov and make an index list.

Current estimate of files list is:

~1,022,500 files (50 files/page × 20,450 pages)
My scraped index so far: 528,586 files / 634,573 URLs
Currently downloading individual files: 24,371 files (29GB)
Download rate ~1 file/sec to avoid getting blocked = ~12 days continuous for full set

Your merged 45GB + 86GB torrents (~500K-700K files) would be a huge help. Happy to cross-reference with my scraped URL list to find any gaps.

UPDATE DATASET 9 Files List:

Progress:

Scraped 529,334 file URLs from Justice .gov (pages 0-18333, ~89% of index)
Downloading individual files: 30K files / 41GB so far
Also grabbed the 86GB DataSet_9.tar.xz torrent (~500K files) - extracting now

Uploaded my URL index to Archive.org - 529K file URLs in JSON format if anyone wants to help download the remaining files.

link: https://archive.org/details/epstein-dataset9-index

The link is live and shows the 75.7MB JSON file available for download.

UPDATE Dataset Size Sanity Check:

Dataset Report Generated: 2026-01-31T23:28:29.198691 Base Path: /mnt/epstein-doj-2026-01-30

Summary

Dataset	Files	Extracted	ZIP	Types
DataSet_1	6,326	2.48 GB	1.23 GB	.pdf, .opt, .dat
DataSet_1_incomplete	3,158	1.24 GB	N/A	.pdf, .opt, .dat
DataSet_2	577	631.66 MB	630.79 MB	.pdf, .dat, .opt
DataSet_3	69	598.51 MB	595.00 MB	.pdf, .dat, .opt
DataSet_4	154	358.43 MB	351.52 MB	.pdf, .opt, .dat
DataSet_5	122	61.60 MB	61.48 MB	.pdf, .dat, .opt
DataSet_6	15	53.02 MB	51.28 MB	.pdf, .opt, .dat
DataSet_7	19	98.29 MB	96.98 MB	.pdf, .dat, .opt
DataSet_8	11,042	10.68 GB	9.95 GB	.pdf, .mp4, .xlsx
DataSet_9_files	35,480	40.44 GB	45.63 GB	.pdf, .mp4, .m4a
DataSet_9_45GB_unique	28	84.18 MB	N/A	.pdf, .dat, .opt
DataSet_9_extracted	531,256	94.51 GB	N/A	.pdf
DataSet_9_45GB_extracted	201,357	47.45 GB	N/A	.pdf, .dat, .opt
DataSet_10_extracted	504,030	81.15 GB	78.64 GB	.pdf, .mp4, .mov
DataSet_11	14,045	1.17 GB	25.56 GB	.pdf
DataSet_12	154	119.89 MB	114.09 MB	.pdf, .dat, .opt
TOTAL	1,307,832	281.07 GB	162.87 GB

https://pastebin.com/zdHbsCwH

here is a little script that can generate the above report if you have your dir something like this:

 # Minimum working example:
  my_directory/
  ├── DataSet_1/
  │   └── (any files)
  ├── DataSet_2/
  │   └── (any files)
  └── DataSet 2.zip  (optional - will be matched)