Automating Runbooks with Python

There's a moment every SRE knows too well: 3 AM, pager screaming, half-asleep, staring at a runbook that says "Step 4: SSH into db-primary and run SELECT pg_stat_activity...". You do it. You document it. Two weeks later, the same alert fires and you do it again.

That's waste. And waste at 3 AM is expensive.

What Makes a Good Automation Candidate

Not every runbook step should be automated. The right candidates share a few properties:

Deterministic: same inputs always produce same outputs
Low blast radius if wrong: restarting a service is safer than deleting data
High frequency: if it happens once a year, manual is fine
Well-understood: automate what you understand, not what you're still learning

Start with the steps you've personally executed more than five times.

The Anatomy of an Automated Runbook

A runbook script should read like the original runbook — clear steps, explicit checks, human-readable output:

#!/usr/bin/env python3
"""
Runbook: High memory usage on application pods
Triggered by: alert/high-mem-usage
"""

import subprocess
import sys
from datetime import datetime

def log(msg: str) -> None:
    print(f"[{datetime.now().strftime('%H:%M:%S')}] {msg}")

def get_top_consumers(namespace: str, limit: int = 5) -> list[dict]:
    """Return top memory-consuming pods in namespace."""
    result = subprocess.run(
        ["kubectl", "top", "pods", "-n", namespace, "--sort-by=memory"],
        capture_output=True, text=True, check=True
    )
    lines = result.stdout.strip().split("\n")[1:limit+1]  # skip header
    pods = []
    for line in lines:
        parts = line.split()
        pods.append({"name": parts[0], "cpu": parts[1], "memory": parts[2]})
    return pods

def restart_pod(namespace: str, pod_name: str) -> bool:
    """Delete pod — controller will recreate it."""
    log(f"Restarting pod: {pod_name}")
    result = subprocess.run(
        ["kubectl", "delete", "pod", pod_name, "-n", namespace],
        capture_output=True, text=True
    )
    return result.returncode == 0

def main(namespace: str = "production") -> None:
    log(f"Starting memory runbook for namespace: {namespace}")

    # Step 1: Identify top consumers
    log("Step 1: Identifying top memory consumers...")
    pods = get_top_consumers(namespace)
    for pod in pods:
        log(f"  {pod['name']}: {pod['memory']} memory, {pod['cpu']} CPU")

    # Step 2: Check if any are above threshold
    # (simplified — real impl parses Mi/Gi values)
    high_mem = [p for p in pods if "Gi" in p["memory"]]

    if not high_mem:
        log("No pods above 1Gi threshold. Runbook complete — no action needed.")
        return

    # Step 3: Restart highest consumer if safe
    target = high_mem[0]
    log(f"Step 3: Restarting {target['name']}...")
    if restart_pod(namespace, target["name"]):
        log(f"Successfully restarted {target['name']}. Monitor for recovery.")
    else:
        log("ERROR: Restart failed. Escalate to on-call lead.")
        sys.exit(1)

if __name__ == "__main__":
    ns = sys.argv[1] if len(sys.argv) > 1 else "production"
    main(ns)

Making It Safe

Raw automation scripts that touch production are dangerous. Layer in safety:

Dry-Run Mode

DRY_RUN = os.getenv("DRY_RUN", "false").lower() == "true"

def restart_pod(namespace: str, pod_name: str) -> bool:
    if DRY_RUN:
        log(f"[DRY RUN] Would delete pod: {pod_name}")
        return True
    # ... actual restart

Always test with DRY_RUN=true first. Make this the default in non-production environments.

Confirmation Gates

For destructive actions, require explicit confirmation:

def confirm(action: str) -> bool:
    response = input(f"About to: {action}\nProceed? [yes/no]: ")
    return response.strip().lower() == "yes"

Or in fully automated mode, skip confirmation but log prominently and send a Slack notification.

Idempotency

Design every step to be safe to run twice:

def ensure_config_set(key: str, value: str) -> None:
    """Set config value — safe to call multiple times."""
    current = get_config(key)
    if current == value:
        log(f"Config {key} already set to {value}, skipping.")
        return
    set_config(key, value)
    log(f"Set {key} = {value}")

Connecting to Your Alert System

Runbooks only save time if they run automatically when the alert fires. With PagerDuty webhooks:

# PagerDuty webhook handler (Flask)
from flask import Flask, request
app = Flask(__name__)

RUNBOOK_MAP = {
    "high-mem-usage": "runbooks/high_memory.py",
    "disk-usage-critical": "runbooks/disk_cleanup.py",
}

@app.route("/webhook/pagerduty", methods=["POST"])
def pagerduty_webhook():
    event = request.json
    alert_name = event.get("alert", {}).get("summary", "")

    if alert_name in RUNBOOK_MAP:
        script = RUNBOOK_MAP[alert_name]
        subprocess.Popen(["python3", script, "--auto"])
        return {"status": "runbook_triggered", "script": script}, 200

    return {"status": "no_runbook_found"}, 200

Audit Trails Are Mandatory

Every automated action in production must be logged somewhere humans can see:

import json
from pathlib import Path

AUDIT_LOG = Path("/var/log/runbooks/audit.jsonl")

def audit(action: str, target: str, outcome: str, metadata: dict = {}) -> None:
    entry = {
        "timestamp": datetime.utcnow().isoformat(),
        "action": action,
        "target": target,
        "outcome": outcome,
        "triggered_by": os.getenv("TRIGGERED_BY", "manual"),
        **metadata
    }
    with AUDIT_LOG.open("a") as f:
        f.write(json.dumps(entry) + "\n")

This lets you answer "what ran last night at 3 AM and why?" without waking anyone up.

Start Small

The worst mistake is trying to automate everything at once. Pick one runbook — ideally the one that fires most often — and automate just that. Run it in dry-run mode for a week. Then enable it in non-production. Then production.

Automation earns trust the same way people do: by being reliable in small things before big ones.