Preservation · Public · Free

Mirror the Archive

The most granular Epstein archive on the public internet — 2,522 source PDFs and 2,795,365 page JPGs, every page its own URL. This is the self-hosting guide. Preservation should be redundant by design.

We mirrored it once so it can't be lost. Here's how to mirror it yourself, because the preservation of a primary-source archive should never depend on a single host. If we get taken down, you have it. If you take us down, the next archivist still has it.

What's in the archive

2,522Source PDFs

2,795,365Page JPGs

421 GBTotal Size

100 %Court Records Extracted

The archive is split across four R2 layers, each individually mirrorable if you only want part of it. A per-layer manifest of every URL is hosted on R2 itself so you don't have to crawl anything.

Layer	Count	Size	Manifest
PDFs (every DOJ release)	2,522	5 GB	pdfs.txt
Court Records pages	30,520	~3 GB	pages-court-records.txt
First Production pages	33,295	19 GB	pages-first-production.txt
EFTA bulk pages	2,731,550	396 GB	pages-efta.txt (240 MB)

The architectural difference

Every other public Epstein archive treats the PDF as the atomic unit. We made every page its own URL. 2,795,365 individually-addressable page JPGs, all on a public CDN. That's the moat — and it's what makes this the most granular Epstein archive on the public internet.

Citations become clickable. "See attached PDF p. 47" stops being a static reference and becomes a hot-linked image anyone with a browser can open.
No PDF renderer required. Phones, kiosks, embeds, social media previews, anything that loads images works.
Per-page downloads are ~180 KB instead of multi-megabyte documents. Bandwidth-cheap for both you and the reader.
Research tools can embed pages directly. The reference search on this site does exactly that — every hit is a static JPG load from R2, zero compute on our side, zero cost per query.
Atomicity enables collaboration. Linking to a specific page lets two researchers point at the same evidence without ambiguity.

Method 1: rclone (recommended)

rclone consumes a URL list and downloads every file in parallel with retries, resumption, and progress reporting. This is the path most archivists will want.

1 — Install rclone

# macOS
brew install rclone

# Linux
curl https://rclone.org/install.sh | sudo bash

# Windows
choco install rclone

2 — Download a manifest

curl -O https://pub-eb8fd5ca806f444981ce78f13b06d52c.r2.dev/manifests/pdfs.txt

3 — Mirror with rclone HTTP backend

# Mirror just the PDFs (5 GB)
rclone copy --http-url https://pub-eb8fd5ca806f444981ce78f13b06d52c.r2.dev \
  :http:/dojscrape-pdfs/files ./mirror/pdfs \
  --transfers 8 --progress

# Or use aria2c with a URL manifest (faster for many small files)
aria2c -i pdfs.txt -d ./mirror/pdfs -j 16 -x 4

Method 2: aria2c (parallel from manifest)

For the per-page JPG layers (millions of small files), aria2c is faster than rclone because it can run more concurrent connections.

# All page JPGs (3 GB — Court Records)
curl -O https://pub-eb8fd5ca806f444981ce78f13b06d52c.r2.dev/manifests/pages-court-records.txt
aria2c -i pages-court-records.txt -d ./mirror/court-records -j 32 -x 4 --auto-file-renaming=false

# The big one: every EFTA page (396 GB, 2.7M files)
curl -O https://pub-eb8fd5ca806f444981ce78f13b06d52c.r2.dev/manifests/pages-efta.txt
aria2c -i pages-efta.txt -d ./mirror/efta -j 32 -x 4 --auto-file-renaming=false

Expect 2–6 hours on residential internet for the full EFTA layer depending on your connection and how aggressively Cloudflare throttles your range. The smaller layers complete in minutes.

Method 3: Single file with curl

For grabbing one specific document:

# A single PDF
curl -O "https://pub-eb8fd5ca806f444981ce78f13b06d52c.r2.dev/dojscrape-pdfs/files/Court%20Records/United%20States%20v.%20Maxwell%2C%20No.%2020-3061%20%282d%20Cir.%202020%29/EFTA02843065.pdf"

# A single page JPG
curl -O "https://pub-eb8fd5ca806f444981ce78f13b06d52c.r2.dev/pages-efta/EFTA00000001/page-001.jpg"

Verification

Each manifest is a plain-text file with one URL per line. You can sanity-check that you got everything with a simple line count:

# Verify your local mirror against the manifest
find ./mirror/efta -name 'page-*.jpg' | wc -l    # should equal 2,731,550
wc -l pages-efta.txt                              # confirms manifest size

Hosting your own copy

Any S3-compatible bucket with public read works (Cloudflare R2, AWS S3, Backblaze B2, Wasabi, MinIO). Upload the layers preserving the directory structure. Then point your audience at your own pub URL — every link in this archive will still resolve relative to your new domain.

If you mirror, open an issue on GitHub so we can add you to a public list of mirrors. Redundancy is the point.