
coming up with that right now, check my comment below.
I made this script and slowly about to finish the crawl, it’s close to 20k+ pages.
It should be done in less than 1-2 hour, and I will upload it to Archive. org

coming up with that right now, check my comment below.
I made this script and slowly about to finish the crawl, it’s close to 20k+ pages.
It should be done in less than 1-2 hour, and I will upload it to Archive. org

Superb, I have 1-8, 11-12.
Only remaining 10 (to complete - downloading from Archive.org now)
Dataset 9 is the biggest. I ended up writing a parser to go through every page on justice.gov and make an index list.
Current estimate of files list is:
Your merged 45GB + 86GB torrents (~500K-700K files) would be a huge help. Happy to cross-reference with my scraped URL list to find any gaps.
UPDATE DATASET 9 Files List:
Progress:
Uploaded my URL index to Archive.org - 529K file URLs in JSON format if anyone wants to help download the remaining files.
link: https://archive.org/details/epstein-dataset9-index
The link is live and shows the 75.7MB JSON file available for download.
UPDATE Dataset Size Sanity Check:
Dataset Report
Generated: 2026-01-31T23:28:29.198691
Base Path: /mnt/epstein-doj-2026-01-30
| Dataset | Files | Extracted | ZIP | Types |
|---|---|---|---|---|
| DataSet_1 | 6,326 | 2.48 GB | 1.23 GB | .pdf, .opt, .dat |
| DataSet_1_incomplete | 3,158 | 1.24 GB | N/A | .pdf, .opt, .dat |
| DataSet_2 | 577 | 631.66 MB | 630.79 MB | .pdf, .dat, .opt |
| DataSet_3 | 69 | 598.51 MB | 595.00 MB | .pdf, .dat, .opt |
| DataSet_4 | 154 | 358.43 MB | 351.52 MB | .pdf, .opt, .dat |
| DataSet_5 | 122 | 61.60 MB | 61.48 MB | .pdf, .dat, .opt |
| DataSet_6 | 15 | 53.02 MB | 51.28 MB | .pdf, .opt, .dat |
| DataSet_7 | 19 | 98.29 MB | 96.98 MB | .pdf, .dat, .opt |
| DataSet_8 | 11,042 | 10.68 GB | 9.95 GB | .pdf, .mp4, .xlsx |
| DataSet_9_files | 35,480 | 40.44 GB | 45.63 GB | .pdf, .mp4, .m4a |
| DataSet_9_45GB_unique | 28 | 84.18 MB | N/A | .pdf, .dat, .opt |
| DataSet_9_extracted | 531,256 | 94.51 GB | N/A | |
| DataSet_9_45GB_extracted | 201,357 | 47.45 GB | N/A | .pdf, .dat, .opt |
| DataSet_10_extracted | 504,030 | 81.15 GB | 78.64 GB | .pdf, .mp4, .mov |
| DataSet_11 | 14,045 | 1.17 GB | 25.56 GB | |
| DataSet_12 | 154 | 119.89 MB | 114.09 MB | .pdf, .dat, .opt |
| TOTAL | 1,307,832 | 281.07 GB | 162.87 GB |
here is a little script that can generate the above report if you have your dir something like this:
# Minimum working example:
my_directory/
├── DataSet_1/
│ └── (any files)
├── DataSet_2/
│ └── (any files)
└── DataSet 2.zip (optional - will be matched)
hey sorry I got super distracted with building a data mapper, but I have the version here, just gov stopped responding to my requests, even though I was quite gracefully requesting the pages:
UPDATE DATASET 9 Files List:
Progress:
Scraped 529,334 file URLs from Justice .gov (pages 0-18333, ~89% of index)
link: https://archive.org/details/epstein-dataset9-index
The link is live and shows the 75.7MB JSON file available for downloa