The Dead-Man's-Switch: Monitoring Automations That Watch Themselves

The scariest failure mode in automation isn't a crash. A crash sends you a red X, a stack trace, an email. The scariest failure mode is silence: the scheduled job that stops running and tells no one.

We run a portfolio of automations — trading bots that paper-trade every market day, data pipelines that ingest healthcare code sets on a monthly cadence, content systems that publish on schedules. Every one of them is a cron job somewhere. And every cron job shares the same weakness: when the scheduler itself breaks — a disabled workflow, an expired credential, a renamed branch, a repo setting someone toggled — nothing runs, so nothing errors, so nothing alerts.

The job doesn't fail. It just stops existing. We've had a workflow sit disabled for weeks because a platform paused it after sixty days of repo inactivity. Everything looked fine. It wasn't running at all.

Inverting the alert

The fix is an old railway idea: the dead man's switch. The driver has to actively hold a lever; if they let go — asleep, incapacitated, gone — the train stops. The system doesn't ask "did something go wrong?" It demands continuous proof that something is going right.

Applied to automations, that means you don't alert on failure signals. You alert on the absence of success signals. Every job in the portfolio is expected to check in — a green workflow run, a fresh commit, an updated artifact — on its own cadence. Daily for the trading bots. Monthly for the ingest pipelines. A watchdog process holds the list of expectations and runs on its own schedule, asking one question per entry: "have you seen good news from this job recently enough?"

No good news, no matter the reason — crash, disablement, deleted workflow, auth expiry — and it opens an alert. Silence and failure become the same event, which is exactly what they are operationally.

What the watchlist looks like

Ours is deliberately boring: a checked-in list of jobs, each with a name, where to look for its heartbeat, and a maximum acceptable staleness. The watchdog reads it, checks each source, and opens or closes issues accordingly. When a job is retired on purpose, you delete its entry in the same commit that retires it — the watchlist is part of the change, not an afterthought.

Two design details earned their keep:

Alerts auto-close. When the heartbeat comes back, the watchdog closes its own issue. Noise you have to manually clean up trains you to ignore the channel. An alert channel you can trust is the entire point.

The watchdog watches itself. It's a scheduled job too, which means it has the same silent-death failure mode as everything it monitors. Its own run history is the heartbeat of last resort — we glance at it when we touch the repo, and the interval between "watchdog died" and "human noticed" is bounded by how often you do routine maintenance. Not perfect, but honest: at some point the recursion has to end at a person.

Retiring things is where it pays off

The unexpected benefit wasn't catching crashes — tests catch crashes. It was catching drift between intention and reality. When we decommissioned a prediction service and its scheduled workflows kept "running" against dead endpoints, the watchdog is what surfaced the mismatch. When a content format was retired but its generator kept a slot on the watchlist, the stale alert forced the conversation: is this dead or alive? Decide, commit, move on.

Automations accumulate. Every one you add is a small promise that something will keep happening without you. A dead-man's-switch is how you keep count of your promises — and find out which ones you've silently broken.

If you run more than two scheduled jobs, you need one. Ours is a few hundred lines and it has paid for itself many times over.

The Dead-Man's-Switch: Monitoring Automations That Watch Themselves

The Dead-Man's-Switch: Monitoring Automations That Watch Themselves

Inverting the alert

What the watchlist looks like

Retiring things is where it pays off

Need Help Building Something Like This?