The archive is split across four R2 layers, each individually mirrorable if you only want part of it. A per-layer manifest of every URL is hosted on R2 itself so you don't have to crawl anything.
| Layer | Count | Size | Manifest |
|---|---|---|---|
| PDFs (every DOJ release) | 2,522 | 5 GB | pdfs.txt |
| Court Records pages | 30,520 | ~3 GB | pages-court-records.txt |
| First Production pages | 33,295 | 19 GB | pages-first-production.txt |
| EFTA bulk pages | 2,731,550 | 396 GB | pages-efta.txt (240 MB) |
Every other public Epstein archive treats the PDF as the atomic unit. We made every page its own URL. 2,795,365 individually-addressable page JPGs, all on a public CDN. That's the moat — and it's what makes this the most granular Epstein archive on the public internet.
rclone consumes a URL list and downloads every file in parallel
with retries, resumption, and progress reporting. This is the path most
archivists will want.
# macOS brew install rclone # Linux curl https://rclone.org/install.sh | sudo bash # Windows choco install rclone
curl -O https://pub-eb8fd5ca806f444981ce78f13b06d52c.r2.dev/manifests/pdfs.txt
# Mirror just the PDFs (5 GB) rclone copy --http-url https://pub-eb8fd5ca806f444981ce78f13b06d52c.r2.dev \ :http:/dojscrape-pdfs/files ./mirror/pdfs \ --transfers 8 --progress # Or use aria2c with a URL manifest (faster for many small files) aria2c -i pdfs.txt -d ./mirror/pdfs -j 16 -x 4
For the per-page JPG layers (millions of small files), aria2c
is faster than rclone because it can run more concurrent connections.
# All page JPGs (3 GB — Court Records) curl -O https://pub-eb8fd5ca806f444981ce78f13b06d52c.r2.dev/manifests/pages-court-records.txt aria2c -i pages-court-records.txt -d ./mirror/court-records -j 32 -x 4 --auto-file-renaming=false # The big one: every EFTA page (396 GB, 2.7M files) curl -O https://pub-eb8fd5ca806f444981ce78f13b06d52c.r2.dev/manifests/pages-efta.txt aria2c -i pages-efta.txt -d ./mirror/efta -j 32 -x 4 --auto-file-renaming=false
Expect 2–6 hours on residential internet for the full EFTA layer depending on your connection and how aggressively Cloudflare throttles your range. The smaller layers complete in minutes.
For grabbing one specific document:
# A single PDF curl -O "https://pub-eb8fd5ca806f444981ce78f13b06d52c.r2.dev/dojscrape-pdfs/files/Court%20Records/United%20States%20v.%20Maxwell%2C%20No.%2020-3061%20%282d%20Cir.%202020%29/EFTA02843065.pdf" # A single page JPG curl -O "https://pub-eb8fd5ca806f444981ce78f13b06d52c.r2.dev/pages-efta/EFTA00000001/page-001.jpg"
Each manifest is a plain-text file with one URL per line. You can sanity-check that you got everything with a simple line count:
# Verify your local mirror against the manifest find ./mirror/efta -name 'page-*.jpg' | wc -l # should equal 2,731,550 wc -l pages-efta.txt # confirms manifest size
Any S3-compatible bucket with public read works (Cloudflare R2, AWS S3, Backblaze B2, Wasabi, MinIO). Upload the layers preserving the directory structure. Then point your audience at your own pub URL — every link in this archive will still resolve relative to your new domain.
If you mirror, open an issue on GitHub so we can add you to a public list of mirrors. Redundancy is the point.