EpsteinScan Hits 1.2 Million Documents
EpsteinScan just crossed 1.2 million documents in the database. That's every publicly released document from the Epstein case — court filings, flight logs, deposition transcripts, all searchable in under 200ms.
How we got here
The original scraper pulled documents from multiple sources: court PACER records, FOIA releases, and digitized archives. Each document goes through OCR if it's a scanned PDF, then gets chunked into searchable segments. The pipeline is fully automated — new releases get ingested within hours of publication.
The search challenge
Full-text search on 1.2 million documents is non-trivial. We're using Supabase's built-in PostgreSQL full-text search with GIN indexes. The key insight was building a composite tsvector that weights different fields:
- Document title: weight A (highest priority)
- First paragraph: weight B
- Body text: weight C
- Metadata (names, dates, locations): weight A
This means searching for "flight log" surfaces the actual flight logs before random mentions in other documents.
Performance optimizations
At this scale, naive queries were taking 2-3 seconds. Here's what brought it under 200ms:
- Materialized views for the search index — rebuilt nightly, queried instantly
- Partial indexes on document type — most users search within a specific category
- Connection pooling via Supavisor — handles concurrent search traffic without exhausting connections
- Result caching — popular queries get cached for 5 minutes
What people are finding
The most searched terms tell a story: names of associates, specific dates, location names. Researchers and journalists are using EpsteinScan as their primary research tool. We've had traffic from major news organizations — can't name them, but the access logs don't lie.
What's next
Working on semantic search using pgvector embeddings. The idea is that you could search for concepts, not just keywords. "meetings in New York about financial arrangements" would surface relevant documents even if they don't contain those exact words.
The database keeps growing as more documents get released. The architecture scales — Supabase handles it fine, and the indexing pipeline is fully automated. This is one of those projects where the infrastructure just works, and I can focus on features instead of firefighting.