Not that much is missing from Epstein DataSet 9 already collected. (lemmy.world)

submitted 2 months ago* (last edited 2 months ago) by ermstein@lemmy.world to c/datahoarder@lemmy.ml

20 comments fedilink hide all child comments

If you merge the three versions of DataSet 9 that are found so far:

DataSet%209.zip : https://github.com/yung-megafone/Epstein-Files

Data Set 9.tar.xz : https://archive.org/details/data-set-9.tar.xz

dataset9-more-complete.tar.zst : https://github.com/yung-megafone/Epstein-Files

You will end up with 531,282 IMAGES files (PDF). You would think that there is a lot missing, however, the partially corrupted DataSet%209.zip gives us a DAT and OPT file to see what files remain.

The DAT file reveals there are only 531,307 IMAGES files (PDF) supposed to be in the archive. Which means only 25 PDF files are actually missing.

You'd notice that 25 PDF files couldn't possibly be the remaining 80-ish GB that remains of the original DataSet 9, but the DAT file doesn't reveal how many NATIVES there were.

NATIVES are media files like videos and audio. You can see an example if you have a full DataSet 10. But from DataSet 10 it reveals to us that all NATIVES have a placeholder as a PDF which is always 4670 bytes.

So by searching all files that are that exact size, it reveals there are about 135 NATIVES (media files) that are missing, which would be the rest of the 80 GB that is missing.

I have listed below what IMAGES (PDF) and NATIVES (media) files are missing, such that it is easy to coordinate to track down the remaining files that we need for a complete DataSet 9.

(Though the remaining PDFs could be placeholder for up to 25 more natives, which would have to be checked when finding them).

Update 1 (February 6):

In my original post (https://lemmy.world/post/42700643), I found that NATIVEs have a placeholder that is 4670 bytes.

However, from comparing every NATIVE in DataSet 10 to it's placeholder I have discovered a second placeholder size that is 2433 bytes.

The NATIVEs estimate is now 2542 (from previous 135).

I have attached the updated NATIVEs list. (And also the same 25 missing IMAGES list (since they also could be native placeholders).

NEW_MISSING_EFTA_NATIVES.txt

MISSING_EFTA_IMAGES.txt

Update 2 (February 6):

I have found 1983/2542 NATIVEs are directly downloadable from the DOJ.

1983_NATIVES_URLS.txt

If anyone wants to attempt the remaining natives, I have tried the following extensions: ".avi",".mp4",".mov",".mp3",".wav",".m4a",".m4v",".wmv",".ts",".vob",".3gp",".amr",".opus",".csv",".xlsx",".xls",".docx",".doc",".pluginpayloadattachment"

you are viewing a single comment's thread
view the rest of the comments

[-] CapableStaircase@lemmy.zip 2 points 2 months ago

I could only grab ~44 of the NATIVEs you’ve listed and they total up to a tiny portion of the expected 80GB remaining. The hard part is guessing what file extension these files will have without getting rate limited by DOJ. I was hoping to get a copy of the zip file’s EOCD but it’s still down.

If anyone ever sees that zip come back please try and download the last 150-200MB. That’s where the zip archive’s table of contents is gonna live.

[-] ermstein@lemmy.world 4 points 2 months ago

One thing you could try is looking at the file extensions from DataSet 10’s Natives so you have fewer to guess from.

The rest of the natives still could be that large but I’ll double check if there are other placeholders.

[-] CapableStaircase@lemmy.zip 1 points 2 months ago

I found this in a random doc today. I’ll add it to your list and give it a shot tonight. It’ll be slow going so I don’t get rate limited again. I think if you hit too many 404’s in a row the CDN locks you out for a bit.

[-] ermstein@lemmy.world 2 points 2 months ago

I updated the post with the URLs that I have found, and what extensions I have tried. Also you can track updates at https://github.com/yung-megafone/Epstein-Files/issues/4

[-] CapableStaircase@lemmy.zip 1 points 2 months ago

For anyone watching this post, I just dropped an update on that issue. Will be posting a new magnet link for the 84GB I was able to download soon.

[-] CapableStaircase@lemmy.zip 1 points 2 months ago

Can you also check and see if dataset 8/10/11 have all the native files they should based on the presence of these placeholders?

[-] ermstein@lemmy.world 2 points 2 months ago

I have updated the post with a list of 2542 NATIVEs instead of 135 after finding a second placeholder size of 2433 bytes.

this post was submitted on 05 Feb 2026

63 points (100.0% liked)

datahoarder

10348 readers

1 users here now

Who are we?

We are digital librarians. Among us are represented the various reasons to keep data -- legal requirements, competitive requirements, uncertainty of permanence of cloud services, distaste for transmitting your data externally (e.g. government or corporate espionage), cultural and familial archivists, internet collapse preppers, and people who do it themselves so they're sure it's done right. Everyone has their reasons for curating the data they have decided to keep (either forever or For A Damn Long Time). Along the way we have sought out like-minded individuals to exchange strategies, war stories, and cautionary tales of failures.

We are one. We are legion. And we're trying really hard not to forget.

-- 5-4-3-2-1-bang from this thread

founded 6 years ago

MODERATORS

archivist@lemmy.ml