THE BAD DAY / WALKTHROUGH

Anatomy of a bad migration

Nobody plans for this. It arrives on a Friday, in a deploy that looked fine in review. Here's the same incident playing out two ways — once the way it usually goes, and once with a tested, off-site checkpoint sealed 30 seconds before the migration ran.

4:55 PM Friday — the deploy

4:55 PM

A migration ships

An ORM auto-migration renames a table. Under the hood it's a DROP and re-create — and a foreign key turns it into a DROP ... CASCADE that quietly takes three dependent tables with it.

4:56 PM

The data is gone

The deploy goes green. Orders, payments, and audit history for the last 18 months are no longer in the database. Nobody notices for forty minutes — until support tickets start.

5:38 PM

The scramble

You confirm the tables are empty. Now the only question that matters: what's the last good copy, and can we actually restore it?

5:40 PM — the fork in the road

Without a tested off-site backup

Platform PITR window is 7 days — but it's in the same account, and restoring it rolls back everything, including the good writes since 4:56.
On a free tier, there's no PITR at all.
The nightly pg_dump cron exists, but nobody has ever restored from it. The first attempt errors on a missing role.
Hours later you get most of it back. The gap between the last good dump and 4:56 is gone for good.
Monday: you write the incident report explaining what was lost.

With an OffsiteDB checkpoint

The CI pipeline sealed a checkpoint at 4:54:30 — a restore-drilled snapshot, tagged pre-deploy-4f3a9c1, taken 30 seconds before the migration.
You already know it restores: it was drilled into a real Postgres cluster when it was sealed, and it's on last month's report.
You restore the three dropped tables from the checkpoint — just those tables, leaving the good writes intact.
restored 184 tables, 9.2M rows — 94 seconds, a command you've watched succeed hundreds of times.
5:42 PM: back online. The incident report is two sentences.

What made the second path possible

Nothing heroic — just three things in place before the bad day, which is the only time they can be:

A pre-migration checkpoint. One step in CI seals a tagged snapshot and blocks until it exists, so there's always a fresh, known-good copy from moments before any migration. See the GitHub Action →
A backup that was already proven. Every snapshot is restored into a throwaway Postgres cluster and row-counted before it's marked sealed — so “can we restore it?” was answered weeks ago, not at 5:40 on a Friday.
An off-site copy you own. The checkpoint lives in your own S3/R2 bucket, encrypted — outside the account, region, and blast radius of the database that just broke.

The cheapest insurance you'll ever expense

You will need a backup once. That day, you'll want one that's already proven it restores and sits one command away. Start a free trial, see a sample drill report, or read how it handles your credentials.