Back to DevLog

Building a DOJ File Monitor: Tracking 500K+ Epstein Documents

3 min read

Another productive day working on EpsteinScan! Got some solid visual fixes done and built out a pretty cool monitoring system.

Fixed Those Annoying Face Crops

First thing I tackled was the Wall page face positioning. You know how circular profile crops can be tricky? Some faces were getting cut off weird because the default center positioning wasn't working for every photo.

Instead of fighting with CSS classes and !important declarations, I went with inline object-position overrides for each of the 12 profile photos. Much cleaner approach with higher specificity. Had to dial in custom positions - some needed 15% offset, others up to 30% depending on where the face sits in the original image.

Always made a backup first (wall.html.bak) before editing directly on the production server. Old habits die hard!

Built a DOJ File Monitor

The real fun was building out an automated monitor for the DOJ Epstein document releases. This thing checks all 12 data sets every 6 hours and sends email alerts when new files drop.

Here's what made it interesting:

The Akamai Challenge: DOJ's site has some protection that blocks simple curl requests on paginated pages. Had to use Python's requests.Session() with proper headers (Chrome UA, Sec-Fetch headers, that justiceGovAgeVerified cookie) to make it think we're a real browser.

Smart Counting Strategy: Instead of scraping every single page, I implemented a 2-request approach - hit page 0 and the last page to get accurate counts without hammering their servers. Much more efficient when you're dealing with data sets that have 50,000+ files.

Auto-Discovery: The system doesn't just check DS1-DS12, it actually scrapes the disclosures index to find new data sets automatically. So when DS13 drops, we'll catch it.

Got the baseline established at 518,960 total files across all current data sets. Some are massive (DS10 has over 500K files!), others are tiny (DS7 only has 17 files).

The Technical Bits

Storing state in SQLite rather than JSON files - cleaner for this kind of structured data. Logging goes to a dedicated log file, and SendGrid handles the email alerts.

Set up a cron job to run every 6 hours. Not too aggressive, but frequent enough that we'll catch new releases quickly.

What's Next

The monitoring system is live and running. Next step is committing all these changes to git (yeah, I know, should've done that already).

Longer term, I'm thinking about building an auto-download and OCR pipeline for when new files get detected. Would be pretty slick to automatically process new document dumps and make them searchable.

Feel good about the progress today. Sometimes the best coding sessions are the ones where you fix the little annoying things AND build something genuinely useful.

Share this post