FIVE WAYS INTO THE ARCHIVE
01 The Oracleconversational AI · primary-document citations 02 Reference Searchkeyword search · 2,368 PDFs · 30k pages 03 Email Pagesvisual database · every email as a paper page 04 Document Cacheby case · by dataset · by date 05 Mirror the Archiveself-host the entire 421 GB · with manifests
Preservation · Public · Free

Mirror the Archive

The most granular Epstein archive on the public internet — 2,522 source PDFs and 2,795,365 page JPGs, every page its own URL. This is the self-hosting guide. Preservation should be redundant by design.
We mirrored it once so it can't be lost. Here's how to mirror it yourself, because the preservation of a primary-source archive should never depend on a single host. If we get taken down, you have it. If you take us down, the next archivist still has it.

What's in the archive

2,522Source PDFs
2,795,365Page JPGs
421 GBTotal Size
100 %Court Records Extracted

The archive is split across four R2 layers, each individually mirrorable if you only want part of it. A per-layer manifest of every URL is hosted on R2 itself so you don't have to crawl anything.

LayerCountSizeManifest
PDFs (every DOJ release) 2,522 5 GB pdfs.txt
Court Records pages 30,520 ~3 GB pages-court-records.txt
First Production pages 33,295 19 GB pages-first-production.txt
EFTA bulk pages 2,731,550 396 GB pages-efta.txt (240 MB)

The architectural difference

Every other public Epstein archive treats the PDF as the atomic unit. We made every page its own URL. 2,795,365 individually-addressable page JPGs, all on a public CDN. That's the moat — and it's what makes this the most granular Epstein archive on the public internet.

Method 1: rclone (recommended)

rclone consumes a URL list and downloads every file in parallel with retries, resumption, and progress reporting. This is the path most archivists will want.

1 — Install rclone

# macOS
brew install rclone

# Linux
curl https://rclone.org/install.sh | sudo bash

# Windows
choco install rclone

2 — Download a manifest

curl -O https://pub-eb8fd5ca806f444981ce78f13b06d52c.r2.dev/manifests/pdfs.txt

3 — Mirror with rclone HTTP backend

# Mirror just the PDFs (5 GB)
rclone copy --http-url https://pub-eb8fd5ca806f444981ce78f13b06d52c.r2.dev \
  :http:/dojscrape-pdfs/files ./mirror/pdfs \
  --transfers 8 --progress

# Or use aria2c with a URL manifest (faster for many small files)
aria2c -i pdfs.txt -d ./mirror/pdfs -j 16 -x 4

Method 2: aria2c (parallel from manifest)

For the per-page JPG layers (millions of small files), aria2c is faster than rclone because it can run more concurrent connections.

# All page JPGs (3 GB — Court Records)
curl -O https://pub-eb8fd5ca806f444981ce78f13b06d52c.r2.dev/manifests/pages-court-records.txt
aria2c -i pages-court-records.txt -d ./mirror/court-records -j 32 -x 4 --auto-file-renaming=false

# The big one: every EFTA page (396 GB, 2.7M files)
curl -O https://pub-eb8fd5ca806f444981ce78f13b06d52c.r2.dev/manifests/pages-efta.txt
aria2c -i pages-efta.txt -d ./mirror/efta -j 32 -x 4 --auto-file-renaming=false

Expect 2–6 hours on residential internet for the full EFTA layer depending on your connection and how aggressively Cloudflare throttles your range. The smaller layers complete in minutes.

Method 3: Single file with curl

For grabbing one specific document:

# A single PDF
curl -O "https://pub-eb8fd5ca806f444981ce78f13b06d52c.r2.dev/dojscrape-pdfs/files/Court%20Records/United%20States%20v.%20Maxwell%2C%20No.%2020-3061%20%282d%20Cir.%202020%29/EFTA02843065.pdf"

# A single page JPG
curl -O "https://pub-eb8fd5ca806f444981ce78f13b06d52c.r2.dev/pages-efta/EFTA00000001/page-001.jpg"

Verification

Each manifest is a plain-text file with one URL per line. You can sanity-check that you got everything with a simple line count:

# Verify your local mirror against the manifest
find ./mirror/efta -name 'page-*.jpg' | wc -l    # should equal 2,731,550
wc -l pages-efta.txt                              # confirms manifest size

Hosting your own copy

Any S3-compatible bucket with public read works (Cloudflare R2, AWS S3, Backblaze B2, Wasabi, MinIO). Upload the layers preserving the directory structure. Then point your audience at your own pub URL — every link in this archive will still resolve relative to your new domain.

If you mirror, open an issue on GitHub so we can add you to a public list of mirrors. Redundancy is the point.