On-Call Without Burnout

On-call is one of the most psychologically demanding parts of working in infrastructure. You carry the weight of production in your pocket — every buzz could be a crisis. Done wrong, it erodes engineers, degrades team culture, and paradoxically makes systems less reliable because burned-out engineers make worse decisions.

Done right, on-call is manageable, fair, and even a source of deep systems knowledge. Here's what that looks like.

The Numbers Don't Lie

Before changing anything, measure what's actually happening:

How many pages per on-call shift?
What percentage of pages fire between 10 PM and 6 AM?
What percentage are actionable vs. noise?
How long do incidents take to resolve on average?

In most teams that report burnout, the data reveals a small number of noisy, poorly-tuned alerts causing the majority of pain. Fix those first — everything else is secondary.

Alert Quality Over Alert Quantity

The most common mistake is conflating "monitoring coverage" with "alerting everything that could possibly be wrong." These are different things.

A good alert has three properties:

Actionable — the on-call engineer can do something about it right now
Urgent — if they don't act now, something meaningful breaks
Accurate — it fires when the condition is true, not spuriously

Alerts that don't meet all three criteria should be demoted to a dashboard, a ticket, or deleted entirely.

Rule of thumb: if you see the alert and your first thought is
"I need to check if this is actually a problem", it's not ready
to page humans.

Rotation Design

Size Matters

A rotation with fewer than 4 people means someone is on-call every other week. That's not a rotation — that's a slow accumulation of resentment. Aim for 5–7 engineers minimum.

If your team is smaller, talk honestly about whether you have enough people to operate the system reliably. Sometimes the honest answer is "we need to hire" or "we need to reduce scope."

Primary and Secondary

A two-tier system reduces pressure on the primary:

Primary: responds to pages, triages, may resolve or escalate
Secondary: backup if primary doesn't acknowledge within 5 minutes; available for major incidents

This catches the "primary is in a tunnel / asleep through their phone" case without waking the entire team.

Handoff Rituals

End every on-call shift with a written handoff:

## On-Call Handoff — Week of Sep 14

### Incidents this week
- 2024-09-12 02:14 — db-replica lag spike, resolved by restarting replication thread
  - Root cause: temp table bloat, ticket #4421 opened
- 2024-09-13 18:47 — false positive from disk-usage alert (Jira cleanup backlog)
  - Alert muted, tuning ticket #4428 opened

### Things to watch
- payment-service deploy going out Monday — owner: @alice
- db-primary maintenance window Wed 01:00–03:00 UTC

### Alert noise this week: 3 noisy alerts → muted pending tuning

This creates institutional memory and ensures the incoming engineer isn't flying blind.

Post-Incident Process

Every significant incident deserves a blameless post-mortem. The word "blameless" is often used but rarely practiced. Here's what it actually means:

It means: the goal is to understand system and process failures, not to identify who made the mistake.

It doesn't mean: people aren't accountable for their actions.

A good post-mortem template:

## Incident Post-Mortem: [Title]
Date: | Duration: | Severity: | Author:

### Impact
What broke, how many users affected, business impact.

### Timeline
- 02:14 — Alert fires: high replication lag
- 02:17 — On-call acknowledges
- 02:23 — Root cause identified: temp table bloat
- 02:31 — Replication thread restarted, lag resolving
- 02:45 — Incident resolved

### Root Cause
What actually caused this?

### Contributing Factors
What conditions allowed the root cause to have this impact?

### What Went Well
Honest acknowledgment of what worked.

### Action Items
| Item | Owner | Due Date |
|------|-------|----------|
| Add temp table monitoring | @bob | Sep 21 |
| Update runbook for replication lag | @alice | Sep 18 |

The action items are the most important part. Post-mortems without action items are just documentation of failures.

Sustainable Compensation

On-call compensation is a loaded topic, but the principle is simple: if you're asking someone to carry a pager and potentially be woken at 3 AM, that has a cost. That cost should be explicitly acknowledged and compensated — whether through pay, time-off-in-lieu, or reduced sprint commitments after heavy weeks.

Teams that don't acknowledge this cost accumulate hidden debt that manifests as attrition.

The Reliability Feedback Loop

Here's the sustainable model:

Engineers on-call encounter pain points
Pain points are logged and tracked (not just suffered)
Team allocates time each sprint to reduce on-call burden
Fixes are deployed, metrics tracked
On-call becomes less painful over time

The key word is allocate. If "reducing alert noise" competes with feature work every sprint and always loses, it will never happen. Protect that time.

Many teams use a rule: dedicate 20% of engineering capacity to reliability and operational improvements. The exact number matters less than the commitment being explicit and defended.

Signs You're Doing It Right

Engineers volunteer to cover extra on-call shifts (they don't dread it)
Alert volume trends downward quarter over quarter
Post-mortems produce action items that actually get done
New engineers are onboarded to on-call gradually, with support
The team discusses on-call health openly in retrospectives

On-call doesn't have to be a rite of suffering. It can be a well-supported function that teaches you more about your systems than anything else — if you build it that way intentionally.