Epstein Files Jan 30, 2026
Data hoarders on reddit have been hard at work archiving the latest Epstein Files release from the U.S. Department of Justice. Below is a compilation of their work with download links.
Please seed all torrent files to distribute and preserve this data.
Epstein Files Data Sets 1-8: INTERNET ARCHIVE LINK
Epstein Files Data Set 1 (2.47 GB): TORRENT MAGNET LINK
Epstein Files Data Set 2 (631.6 MB): TORRENT MAGNET LINK
Epstein Files Data Set 3 (599.4 MB): TORRENT MAGNET LINK
Epstein Files Data Set 4 (358.4 MB): TORRENT MAGNET LINK
Epstein Files Data Set 5: (61.5 MB) TORRENT MAGNET LINK
Epstein Files Data Set 6 (53.0 MB): TORRENT MAGNET LINK
Epstein Files Data Set 7 (98.2 MB): TORRENT MAGNET LINK
Epstein Files Data Set 8 (10.67 GB): TORRENT MAGNET LINK
Epstein Files Data Set 9 (Incomplete). Only contains 49 GB of 180 GB. Multiple reports of cutoff from DOJ server at offset 48995762176.
ORIGINAL JUSTICE DEPARTMENT LINK
SHA1: 6ae129b76fddbba0776d4a5430e71494245b04c4
/u/susadmin’s More Complete Data Set 9 (96.25 GB)
De-duplicated merger of (45.63 GB + 86.74 GB) versions
Unverified version incomplete at ~101 GB.
Epstein Files Data Set 10 (78.64GB)
ORIGINAL JUSTICE DEPARTMENT LINK
SHA256: 7D6935B1C63FF2F6BCABDD024EBC2A770F90C43B0D57B646FA7CBD4C0ABCF846 MD5: B8A72424AE812FD21D225195812B2502
Epstein Files Data Set 11 (25.55GB)
ORIGINAL JUSTICE DEPARTMENT LINK
SHA1: 574950c0f86765e897268834ac6ef38b370cad2a
Epstein Files Data Set 12 (114.1 MB)
ORIGINAL JUSTICE DEPARTMENT LINK
SHA1: 20f804ab55687c957fd249cd0d417d5fe7438281
MD5: b1206186332bb1af021e86d68468f9fe
SHA256: b5314b7efca98e25d8b35e4b7fac3ebb3ca2e6cfd0937aa2300ca8b71543bbe2
This list will be edited as more data becomes available, particularly with regard to Data Set 9.
ADD MY DISCORD FOR MORE DISCUSSION: redbarinternet
Dataset 9 is cooked:


Anyone got discord? I have been scraping the website collecting all the links. There is not even 1million links available. I have a full dashboard for it:

Theres no where close to 3.5 million files even in the full link collection. I have scraped all possible links too.

Current streak is how many page in a row on dataset 9 that it scraped before finding a new page. My threshold for stopping is set at 4000 duplicate pages in a row.
Why?

Yes this means what you think. Each streak represents at least 5 separate times where there were greater than 400 duplicate pages before new data was found. These are unique instances too so that means at one point you would have went through 436 pages before a new one, then another was 816 pages before a new one, and so on.
Total counts based on links available that were scraped at the time my database tracked them:
This indicates that we have the potential to download ~900k files out of the current document range that we should have 1 to 2,731,783You might try merging with the set below to see if you’ve scraped files that aren’t in it?
/u/susadmin’s More Complete Data Set 9 (96.25 GB)
De-duplicated merger of (45.63 GB + 86.74 GB) versionsI bet you’ve grabbed a bunch of missing pieces from the puzzle.
I’ll check it out. I was mostly working to scrape all links since there was no direct download. Then, downloading those. So, I mostly did data collection on what was there first.
I also fully believe that dataset 9 is being throttled. Its fragmented. Its broken. The indexing is wrong, etc
I am seeding sets 1-8, 10-12, and the larger set 9. Seedbox is outside the US and has a very fast connection.
I will keep an eye on this post for other sets. 👍
ADD MY DISCORD FOR MORE DISCUSSION: redbarinternet
Dataset 11 Scraped URLs info

Dataset 10 Scraped URLs info

Dataset 9 Scraped URLs info

ADD MY DISCORD FOR MORE DISCUSSION: redbarinternet
Thx for posting, seed if you can ppl.
Funny how a rag-tag ad-hoc group can seed data so much better than the DOJ. Beautiful to see in action.
The doj could do better, they are ordered not to.
Heads up that the DOJ site is a tar pit, it’s going to return 50 files on the page regardless of the page number your on seems like somewhere between 2k-5k pages it just wraps around right now.
Testing page 2000... ✓ 50 new files (out of 50)
Testing page 5000... ○ 0 new files - all duplicates
Testing page 10000... ○ 0 new files - all duplicates
Testing page 20000... ○ 0 new files - all duplicates
Testing page 50000... ○ 0 new files - all duplicates
Testing page 100000... ○ 0 new files - all duplicatesThe last page I got a non-duplicate URL from was 10853 which curiously only had 36 URLs on page. When I browsed directly to page 10853 36 URLs were displayed but then moving back and forth in the page count the tar pit logic must have re-looped there and it went back to 50 Displayed. I ended with 224751 URLs
I saw this too; yesterday I tried manually accessing the page to explore just how many there are. Seems like some of the pages are duplicates (I was simply comparing the last listed file name and content between some of the first 10 pages, and even had 1-2 duplications.)
Far as maximum page number goes, if you use the query parameter
?page=200000000it will still resolve a list of files. — actually crazy.https://www.justice.gov/epstein/doj-disclosures/data-set-9-files?page=200000000
I’m working on a different method of obtaining a complete dataset zip for dataset 9. For those who are unaware, for a time yesterday there was an official zip available from the DOJ. To my knowledge no one was able to fully grab it. But I believe the 49Gb zip is a partial of that before downloads got cut. It’s my thought that this original zip likely contained incriminating information and it’s why it got halted.
What I’ve observed is that Akamai still serves that zip sporadically in small chunks. It’s really strange and I’m not sure why it does, but I have verified with
stringsthat there are pdf file names in the zip data. I’ve been able to use a script to pull small chunks from the CDN across the entire span of the file’s byte range.Using the 49GB file as a starting point I’m working on piecing the file together, however progress is extremely extremely slow. If there is anyone willing to team up on this and combine the chunks please let me know.
How to grab the chunked data:
Script link: https://pastebin.com/9Dj2Nhyb
For the script will probably have to:
pip install richGrab DATASET 9, INCOMPLETE AT ~48GB:
magnet:?xt=urn:btih:0a3d4b84a77bd982c9c2761f40944402b94f9c64&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2FannounceThen name the downloaded file 0-(the last byte the file spans).bin
So for example the 48 GB file it would be:
0-48995762175.binNext to the python script make a directory called:
DataSet 9.zip.chunksMove the renamed first byte range 48 GB file in to that directory.
Make a new file next to the script called
cookies.txtInstall the cookie editor browser extension (https://cookie-editor.com/)
With the browser extension open go to: https://www.justice.gov/age-verify?destination=%2Fepstein%2Ffiles%2FDataSet+9.zip
The download should start in your browser, cancel it.
Export the cookies in Netscape Format. They will copy to your clipboard.
Paste those in your
cookies.txt, save and close it.You can run the script like so:
python3 script.py \ 'https://www.justice.gov/epstein/files/DataSet%209.zip' \ -o 'DataSet 9.zip' \ --cookies cookies.txt --retries 3 \ --backoff 5.0 \ --referer 'https://www.justice.gov/age-verify?destination=%2Fepstein%2Ffiles%2FDataSet+9.zip' \ -t auto -c autoScript Options:
-t- The number of concurrent threads to use which results in trying that many byte ranges at the same time. Setting this toautowill auto calculate based on your CPU but will cap at 8 to be safe and avoid getting banned by Akamai.-c- The chunk size to request from the server in MB. This is not always respected by the server and you may get a smaller or larger chunk, but the script should handle that. Setting this toautoscales with the file size, though feel free to try different sizes.--backoff- The backoff factor between failures, helps prevent Akimai throttling your requests.--retries- The number of times to retry a byte range for that iteration before moving on to the next byte range. If it moves on it will come back to it again on the next loop.--cookies- The path to the file containing your Netscape formatted cookies.-o- The final file name. The chunks directory is derived from this so make sure it matches the name of the chunk directory that you primed with the torrent chunk.--referer- Just leave this for Akimai, set the referer http header.
There are more options if you tun the script with the
--helpoption.If you start to receive HTML and or HTTP/200 responses then you need to refresh your cookie.
If you start to receive HTTP/400 responses then you need to refresh your cookie in a different browser, Akamai is very fussy.
A VPN and multiple browser might be useful to change your cookie and location combo.
Edit
I tested the script on Dataset 8 and it was able to stitch a valid zip together so assuming we’re getting valid data with Dataset 9 it should work.
Awesome, I don’t really understand what’s happening but I’m also running it (also doing it for the presumably exact same 48GB torrent, but I’m supposed to do that right?)
this method is not working for me anymore
Yeah :/ I haven’t been able pull anything in a while now.I was just able to pull 6 chunks, the data is still out there!I messaged you on the other site; I’m currently getting a
Could not determine Content-Length (got None)errorWhat happens when you go to
https://www.justice.gov/epstein/files/DataSet%209.zipin your browser?age gate > page not found
Yeah when I run into this I’ve switched browsers and it’s helped. I’ve also switched IP addresses and it’s helped.
alrighty, I’m currently in the middle of the archive.org upload but I can transfer the chunks I already have over to a different machine and do it there with a new IP
I also was getting the same error. Going to the link successfully downloads.
Updating the cookies fixed the issue.
Can also confirm, receiving more chunks again.
EDIT: Someone should play around with the retry and backoff settings to see if a certain configuration can avoid being blocked for a longer period of time. IP rotating is too much trouble.
Updated the script to display information better: https://pastebin.com/S4gvw9q1
It has one library dependency so you’ll have to do:
pip install richI haven’t been getting blocked with this:
python script.py 'https://www.justice.gov/epstein/files/DataSet%209.zip' -o 'DataSet 9.zip' --cookies cookie.txt --retries 2 --referer 'https://www.justice.gov/age-verify?destination=%2Fepstein%2Ffiles%2FDataSet+9.zip' --ua '<set-this>' --timeout 90 -t 16 -c autoThe new script can auto set threads and chunks, I updated the main comment with more info about those.
I’m setting the
--uaoption which let’s you override the user agent header. I’m making sure it matches the browser that I use to request the cookie.
I would be interested in obtaining the chunks that you gathered and stitch them to what I gathered.
Nor I. I got a single chunk back before never getting anything again.
I’m using a partial download I already had and not the 48gb version but I will be gathering as many chunks as I can as well. Thanks for making this
how big is the partial that you managed to get?
about 25gb
reposting a full magnet list (besides 9) of all the datasets that was on reddit with healthy seeds:
Dataset 1 (2.47GB)
magnet:?xt=urn:btih:4e2fd3707919bebc3177e85498d67cb7474bfd96&dn=DataSet+1&xl=2658494752&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%2Ftracker.torrent.eu.org%3A451%2Fannounce&tr=udp%3A%2F%2Fopen.stealth.si%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.moeking.me%3A6969%2FannounceDataset 2 (631.6MB)
magnet:?xt=urn:btih:d3ec6b3ea50ddbcf8b6f404f419adc584964418a&dn=DataSet+2&xl=662334369&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%2Ftracker.torrent.eu.org%3A451%2Fannounce&tr=udp%3A%2F%2Fopen.stealth.si%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.moeking.me%3A6969%2FannounceDataset 3 (599.4MB)
magnet:?xt=urn:btih:27704fe736090510aa9f314f5854691d905d1ff3&dn=DataSet+3&xl=628519331&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%2Ftracker.torrent.eu.org%3A451%2Fannounce&tr=udp%3A%2F%2Fopen.stealth.si%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.moeking.me%3A6969%2FannounceDataset 4 (358.4MB)
magnet:?xt=urn:btih:4be48044be0e10f719d0de341b7a47ea3e8c3c1a&dn=DataSet+4&xl=375905556&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%2Ftracker.torrent.eu.org%3A451%2Fannounce&tr=udp%3A%2F%2Fopen.stealth.si%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.moeking.me%3A6969%2FannounceDataset 5 (61.5MB)
magnet:?xt=urn:btih:1deb0669aca054c313493d5f3bf48eed89907470&dn=DataSet+5&xl=64579973&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%2Ftracker.torrent.eu.org%3A451%2Fannounce&tr=udp%3A%2F%2Fopen.stealth.si%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.moeking.me%3A6969%2FannounceDataset 6 (53.0MB)
magnet:?xt=urn:btih:05e7b8aefd91cefcbe28a8788d3ad4a0db47d5e2&dn=DataSet+6&xl=55600717&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%2Ftracker.torrent.eu.org%3A451%2Fannounce&tr=udp%3A%2F%2Fopen.stealth.si%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.moeking.me%3A6969%2FannounceDataset 7 (98.2MB)
magnet:?xt=urn:btih:bcd8ec2e697b446661921a729b8c92b689df0360&dn=DataSet+7&xl=103060624&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%2Ftracker.torrent.eu.org%3A451%2Fannounce&tr=udp%3A%2F%2Fopen.stealth.si%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.moeking.me%3A6969%2FannounceDataset 8 (10.67GB)
magnet:?xt=urn:btih:c3a522d6810ee717a2c7e2ef705163e297d34b72&dn=DataSet%208&xl=11465535175&tr=udp%3A%2F%2Ftracker.torrent.eu.org%3A451%2Fannounce&tr=udp%3A%2F%2Fopen.stealth.si%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%2Ftracker.moeking.me%3A6969%2FannounceDataset 10 (78.64GB)
magnet:?xt=urn:btih:d509cc4ca1a415a9ba3b6cb920f67c44aed7fe1f&dn=DataSet%2010.zip&xl=84439381640Dataset 11 (25.55GB)
magnet:?xt=urn:btih:59975667f8bdd5baf9945b0e2db8a57d52d32957&xt=urn:btmh:12200ab9e7614c13695fe17c71baedec717b6294a34dfa243a614602b87ec06453ad&dn=DataSet%2011.zip&xl=27441913130&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%2Fopen.demonii.com%3A1337%2Fannounce&tr=udp%3A%2F%2Fopen.stealth.si%3A80%2Fannounce&tr=udp%3A%2F%2Fexodus.desync.com%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.torrent.eu.org%3A451%2Fannounce&tr=http%3A%2F%2Fopen.tracker.cl%3A1337%2Fannounce&tr=udp%3A%2F%2Ftracker.srv00.com%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.filemail.com%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.dler.org%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker-udp.gbitt.info%3A80%2Fannounce&tr=udp%3A%2F%2Frun.publictracker.xyz%3A6969%2Fannounce&tr=udp%3A%2F%2Fopen.dstud.io%3A6969%2Fannounce&tr=udp%3A%2F%2Fleet-tracker.moe%3A1337%2Fannounce&tr=https%3A%2F%2Ftracker.zhuqiy.com%3A443%2Fannounce&tr=https%3A%2F%2Ftracker.pmman.tech%3A443%2Fannounce&tr=https%3A%2F%2Ftracker.moeblog.cn%3A443%2Fannounce&tr=https%3A%2F%2Ftracker.alaskantf.com%3A443%2Fannounce&tr=https%3A%2F%2Fshahidrazi.online%3A443%2Fannounce&tr=http%3A%2F%2Fwww.torrentsnipe.info%3A2701%2Fannounce&tr=http%3A%2F%2Fwww.genesis-sp.org%3A2710%2FannounceDataset 12 (114.0MB)
magnet:?xt=urn:btih:EE6D2CE5B222B028173E4DEDC6F74F08AFBBB7A3&dn=DataSet%2012.zip&tr=udp%3a%2f%2ftracker.openbittorrent.com%3a80%2fannounce&tr=udp%3a%2f%2ftracker.opentrackr.org%3a1337%2fannounceThank you for this!
I’ve added all magnet links for sets 1-8 to the original post. Magnet links for 9-11 match OP. Magnet link for 12 is different, but we’ve identified that there are at least two versions. DOJ removed files before the second version was downloaded. OP contains the early version of data set 12.
I’m in the process of downloading both dataset 9 torrents (45.63 GB + 86.74 GB). I will then compare the filenames in both versions (the 45.63GB version has 201,358 files alone), note any duplicates, and merge all unique files into one folder. I’ll upload that as a torrent once it’s done so we can get closer to a complete dataset 9 as one file.
- Edit 31Jan2026 816pm EST - Making progress. I finished downloading both dataset 9s (45.6 GB and the 86.74 GB). The 45.6GB set is 200,000 files and the 86GB set is 500,000 files. I have a .csv of the filenames and sizes of all files in the 45.6GB version. I’m creating the same .csv for the 86GB version now.
-
Edit 31Jan2026 845pm EST -
- dataset 9 (45.63 GB) = 201357 files
- dataset 9 (86.74 GB) = 531257 files
I did an exact filename combined with an exact file size comparison between the two dataset9 versions. I also did an exact filename combined with a fuzzy file size comparison (tolerance of +/- 1KB) between the two dataset9 versions. There were:
- 201330 exact matches
- 201330 fuzzy matches (+/- 1KB)
Meaning there are 201330 duplicate files between the two dataset9 versions.
These matches were written to a duplicates file. Then, from each dataset9 version, all files/sizes matching the file and size listed in the duplicates file will be moved to a subfolder. Then I’ll merge both parent folders into one enormous folder containing all unique files and a folder of duplicates. Finally, compress it, make a torrent, and upload it.
-
Edit 31Jan2026 945pm EST -
Still moving duplicates into subfolders.
-
Edit 31Jan2026 1027pm EST -
Going off of xodoh74984’s comment (https://lemmy.world/post/42440468/21884588), I’m increasing the rigor of my determination of whether the files that share a filename and size between both version of dataset9 are in fact duplicates. This will be identical to
rsync --checksumto verify bit-for-bit that the files are the same by calculating their MD5 hash. This will take a while but is the best way.
-
Edit 01Feb2026 1227am EST -
Checksum comparison complete. 73 files found that have the same file name and size but different content. Total number of duplicate files = 201257. Merging both dataset versions now, while keeping one subfolder of the duplicates, so nothing is deleted.
-
Edit 01Feb2026 1258am EST -
Creating the
.tar.zstfile now. 531285 total files, which includes all unique files between dataset9 (45.6GB) and dataset9 (86.7GB), as well as a subfolder containing the files that were found in both dataset9 versions.
-
Edit 01Feb2026 215am EST -
I was using wayyyy to high a compression value for no reason (
ztsd --ultra --22). Restarted the.tar.zstfile creation (withztsd -12) and it’s going 100x faster now. Should be finishedwithin the hour
-
Edit 01Feb2026 311am EST -
.tar.zstfile creation is taking very long. I’m going to let it run overnight - will check back in a few hours. I’m tired boss.
- EDIT 01Feb2026 831am EST -
COMPLETE!
And then I doxxed myself in the torrent. One moment please while I fix that…
Final magnet link is HERE. GO GO GOOOOOO
I’m seeding @ 55 MB/s. I’m also trying to get into the new r/EpsteinPublicDatasets subreddit to share the torrent there.
Have a good night. I’ll be waiting to download it, seed it, make hardcopies and redistribute it.
Please check back in with us
Thank you so much for keeping us updated!!
Superb, I have 1-8, 11-12.
Only remaining 10 (to complete - downloading from Archive.org now)
Dataset 9 is the biggest. I ended up writing a parser to go through every page on justice.gov and make an index list.
Current estimate of files list is:
- ~1,022,500 files (50 files/page × 20,450 pages)
- My scraped index so far: 528,586 files / 634,573 URLs
- Currently downloading individual files: 24,371 files (29GB)
- Download rate ~1 file/sec to avoid getting blocked = ~12 days continuous for full set
Your merged 45GB + 86GB torrents (~500K-700K files) would be a huge help. Happy to cross-reference with my scraped URL list to find any gaps.
UPDATE DATASET 9 Files List:
Progress:
- Scraped 529,334 file URLs from Justice .gov (pages 0-18333, ~89% of index)
- Downloading individual files: 30K files / 41GB so far
- Also grabbed the 86GB DataSet_9.tar.xz torrent (~500K files) - extracting now
Uploaded my URL index to Archive.org - 529K file URLs in JSON format if anyone wants to help download the remaining files.
link: https://archive.org/details/epstein-dataset9-index
The link is live and shows the 75.7MB JSON file available for download.
UPDATE Dataset Size Sanity Check:
Dataset Report Generated: 2026-01-31T23:28:29.198691 Base Path:
/mnt/epstein-doj-2026-01-30Summary
Dataset Files Extracted ZIP Types DataSet_1 6,326 2.48 GB 1.23 GB .pdf, .opt, .dat DataSet_1_incomplete 3,158 1.24 GB N/A .pdf, .opt, .dat DataSet_2 577 631.66 MB 630.79 MB .pdf, .dat, .opt DataSet_3 69 598.51 MB 595.00 MB .pdf, .dat, .opt DataSet_4 154 358.43 MB 351.52 MB .pdf, .opt, .dat DataSet_5 122 61.60 MB 61.48 MB .pdf, .dat, .opt DataSet_6 15 53.02 MB 51.28 MB .pdf, .opt, .dat DataSet_7 19 98.29 MB 96.98 MB .pdf, .dat, .opt DataSet_8 11,042 10.68 GB 9.95 GB .pdf, .mp4, .xlsx DataSet_9_files 35,480 40.44 GB 45.63 GB .pdf, .mp4, .m4a DataSet_9_45GB_unique 28 84.18 MB N/A .pdf, .dat, .opt DataSet_9_extracted 531,256 94.51 GB N/A .pdf DataSet_9_45GB_extracted 201,357 47.45 GB N/A .pdf, .dat, .opt DataSet_10_extracted 504,030 81.15 GB 78.64 GB .pdf, .mp4, .mov DataSet_11 14,045 1.17 GB 25.56 GB .pdf DataSet_12 154 119.89 MB 114.09 MB .pdf, .dat, .opt TOTAL 1,307,832 281.07 GB 162.87 GB
here is a little script that can generate the above report if you have your dir something like this:
# Minimum working example: my_directory/ ├── DataSet_1/ │ └── (any files) ├── DataSet_2/ │ └── (any files) └── DataSet 2.zip (optional - will be matched)Would love to help still from my PC on dataset 9 specifically. Any way we can exchange progress so I won’t start with downloading files you already have downloaded?
E: just started scraping starting from page 18330 (as you mentioned you ended around 18333), hoping I can fill in the remaining 4000-ish pages
Update 2 (1715UTC): just finished scraping up until the page 20500 limit you set in the code. There are 0 new files in the range between 18330-20500 compared to the ones you already found. So unless I did something wrong, either your list is complete or the DOJ has been scrambling their shit (considering the large number of duplicate pages, I’m going with the second explanation).
Either way, I’m gonna extract the 48GB and 100GB torrent directories now and try to mark down which of the files already exist within those torrents, so we can make an (intermediate) list of which files are still missing from them
I’m downloading 8-11 now, I’m seeding 1-7+12 now. I’ve tried checking up on reddit, but every other time i check in the post is nuked or something. My home server never goes down and I’m outside USA. I’m working on the 100GB+ #9 right now and I’ll seed whatever you can get up here too.
looking forward to your torrent, will seed.
I have several incomplete sets of files from dataset 9 that I downloaded with a scraped set of urls - should I try to get them to you to compare as well?
Yes! I’m not sure the best way to do that - upload them to MEGA and message me a download link?
maybe archive.org? that way they can be torrented if others want to attempt their own merging techniques? either way it will be a long upload, my speed is not especially good. I’m still churning through one set of urls that is 1.2M lines, most are failing but I have 65k from that batch so far.
archive.org is a great idea. Post the link here when you can!
I’ll get the first set (42k files in 31G) uploading as soon as I get it zipped up. it’s the one least likely to have any new files in it since I started at the beginning like others but it’s worth a shot
edit 01FEB2026 1208AM EST - 6.4/30gb uploaded to archive.org
edit 01FEB2026 0430AM EST - 13/30gb uploaded to archive.org; scrape using a different url set going backwards is currently at 75.4k files
edit 01FEB2026 1233PM EST - had an internet outage overnight and lost all progress on the archive.org upload, currently back to 11/30gb. the scrape using a previous url set seems to be getting very few new files now, sitting at 77.9k at the moment
When merging versions of Data Set 9, is there any risk of loss with simply using
rsync --checksumto dump all files into one directory?rsync --checksumis better than my file name + file size comparison, since you are calculating the hash of each file and comparing it to the hash all other files. For example, if there is a file called data1.pdf with size 1024 bytes in dataset9-v1, and another file called data1.pdf with size 1024 bytes in dataset9-v2, but their content is different, my method will still detect them as identical files.I’m going to modify my script to calculate and compare the hashes of all files that I previously determined to be duplicates. If the hashes of the duplicates in dataset9 (45GB torrent) match the hashes of the duplicates in dataset9 (86GB torrent), then they are in fact duplicates between the two datasets.
Amazing, thank you. That was my thought, check hashes while merging the files to keep any copies that might have been modified by DOJ and discard duplicates even if the duplicates have different metadata, e.g. timestamps.
Be prepared to wait a while… idk why this person chose xz, it is so slow. I’ve been just trying to get the tarball out for an hour.
I was quick to download dataset 12 after it was discovered to exist, and apparently my dataset 12 contains some files that were later removed. Uploaded to IA in case it contains anything that later archivists missed. https://archive.org/details/data-set-12_202602
Specifically doc number 2731361 and others around it were at some point later removed from DoJ, but are still within this early-download DS12. Maybe more, unsure
I’ve got that one too, maybe we should compare dataset 12 versions too
I’ve hopped on the 10 mag, will be seeding all night and then some. This might be one of the healthiest swarms I’ve ever seen
Link to the Data Set 9 incomplete ~100GB uncompressed 86GB compressed torrent: https://archive.org/details/data-set-9.tar.xz
Posted here due to Reddit suppressing this information by deleting any posts with that information.
Magnet:
magnet:?xt=urn:btih:acb9cb1741502c7dc09460e4fb7b44eac8022906&dn=DataSet_9.tar.xz&xl=93143408940&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%2Fopen.demonii.com%3A1337%2Fannounce&tr=udp%3A%2F%2Fopen.stealth.si%3A80%2Fannounce&tr=http%3A%2F%2Fopen.tracker.cl%3A1337%2Fannounce&tr=udp%3A%2F%2Ftracker.torrent.eu.org%3A451%2Fannounce&tr=udp%3A%2F%2Ftracker.theoks.net%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.srv00.com%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.qu.ax%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.filemail.com%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.dler.org%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.alaskantf.com%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker-udp.gbitt.info%3A80%2Fannounce&tr=udp%3A%2F%2Ft.overflow.biz%3A6969%2Fannounce&tr=udp%3A%2F%2Fopentracker.io%3A6969%2Fannounce&tr=udp%3A%2F%2Fopen.dstud.io%3A6969%2Fannounce&tr=udp%3A%2F%2Fmartin-gebhardt.eu%3A25%2Fannounce&tr=udp%3A%2F%2Fevan.im%3A6969%2Fannounce&tr=udp%3A%2F%2Fd40969.acod.regrucolo.ru%3A6969%2Fannounce&tr=udp%3A%2F%2F6ahddutb1ucc3cp.ru%3A6969%2Fannounce&tr=https%3A%2F%2Ftracker.zhuqiy.com%3A443%2FannounceWhat version is this?
The same version I replied to, the one in that Internet Archive link (the 86GB, extracts to 100GB, has half the files missing)
Ah I see now! Sorry, I’m new to this platform and I need to get used to the structure of it.
Thanks
Yeah it’s a little different but similar enough to reddit. Better than reddit too. I just got permabanned off reddit (15 year account) for saying people > property
They’re probably too dumb to understand “>” means “greater then” or in your sentence: People are worth more then property / people over property.
They probably read it like “People are property” which would obviously be “=” or “->” instead of “>”.
oh it was abundantly clear, the exact comment that got me banned was:
“property damage oh the horror boo fucking hoo literally cry harder”
(the context was the Kenosha “riots” where Rittenhouse happened, and in a later reply I said people > property)
I was banned for “inciting violence”; my appeal was denied because included in the reddit rules for “inciting violence” is “places” lmfao, even though the Wikipedia definition of “violence” defines it as explicitly towards “living beings”
the ownership class sure loves their property
Does anyone have an index of filenames/links from the DOJ website scraped?
Edit, specifically for DataSet 9.
I’m waiting for /u/Kindly_District9380 's version but I’ve been slowly working backwards on this in the meantime https://archive.org/details/dataset9_url_list
I’ve been checking your URLs but it seems you’ve got a lot without a downloadable document attached?
yeah I’m not the one who generated the url list but I’ve also been getting a lot without a downloadable document. I’m going to start on one of the url lists posted here soon
nice. Kinda feeling like we can’t be sure whether our URL lists are ever exhaustive enough or that the DOJ might just let a large part of the dataset go dark
yep, impossible to know
hey sorry I got super distracted with building a data mapper, but I have the version here, just gov stopped responding to my requests, even though I was quite gracefully requesting the pages:
UPDATE DATASET 9 Files List:
Progress:
Scraped 529,334 file URLs from Justice .gov (pages 0-18333, ~89% of index)
- Downloading individual files: 30K files / 41GB so far
- Also grabbed the 86GB DataSet_9.tar.xz torrent (~500K files) - extracting now
- Uploaded my URL index to Archive.org - 529K file URLs in JSON format if anyone wants to help download the remaining files.
link: https://archive.org/details/epstein-dataset9-index
The link is live and shows the 75.7MB JSON file available for downloa
No worries, thank you!
edit: I’ll start on that url list (randomized) tomorrow, my run from the previously generated url list is still going (currently 75.6k files)
coming up with that right now, check my comment below.
I made this script and slowly about to finish the crawl, it’s close to 20k+ pages.
It should be done in less than 1-2 hour, and I will upload it to Archive. org


