Automating Runbooks with Python
Automating Runbooks with Python
There's a moment every SRE knows too well: 3 AM, pager screaming, half-asleep, staring at a runbook that says "Step 4: SSH into db-primary and run SELECT pg_stat_activity...". You do it. You document it. Two weeks later, the same alert fires and you do it again.
That's waste. And waste at 3 AM is expensive.
What Makes a Good Automation Candidate
Not every runbook step should be automated. The right candidates share a few properties:
- Deterministic: same inputs always produce same outputs
- Low blast radius if wrong: restarting a service is safer than deleting data
- High frequency: if it happens once a year, manual is fine
- Well-understood: automate what you understand, not what you're still learning
Start with the steps you've personally executed more than five times.
The Anatomy of an Automated Runbook
A runbook script should read like the original runbook — clear steps, explicit checks, human-readable output:
#!/usr/bin/env python3
"""
Runbook: High memory usage on application pods
Triggered by: alert/high-mem-usage
"""
import subprocess
import sys
from datetime import datetime
def log(msg: str) -> None:
print(f"[{datetime.now().strftime('%H:%M:%S')}] {msg}")
def get_top_consumers(namespace: str, limit: int = 5) -> list[dict]:
"""Return top memory-consuming pods in namespace."""
result = subprocess.run(
["kubectl", "top", "pods", "-n", namespace, "--sort-by=memory"],
capture_output=True, text=True, check=True
)
lines = result.stdout.strip().split("\n")[1:limit+1] # skip header
pods = []
for line in lines:
parts = line.split()
pods.append({"name": parts[0], "cpu": parts[1], "memory": parts[2]})
return pods
def restart_pod(namespace: str, pod_name: str) -> bool:
"""Delete pod — controller will recreate it."""
log(f"Restarting pod: {pod_name}")
result = subprocess.run(
["kubectl", "delete", "pod", pod_name, "-n", namespace],
capture_output=True, text=True
)
return result.returncode == 0
def main(namespace: str = "production") -> None:
log(f"Starting memory runbook for namespace: {namespace}")
# Step 1: Identify top consumers
log("Step 1: Identifying top memory consumers...")
pods = get_top_consumers(namespace)
for pod in pods:
log(f" {pod['name']}: {pod['memory']} memory, {pod['cpu']} CPU")
# Step 2: Check if any are above threshold
# (simplified — real impl parses Mi/Gi values)
high_mem = [p for p in pods if "Gi" in p["memory"]]
if not high_mem:
log("No pods above 1Gi threshold. Runbook complete — no action needed.")
return
# Step 3: Restart highest consumer if safe
target = high_mem[0]
log(f"Step 3: Restarting {target['name']}...")
if restart_pod(namespace, target["name"]):
log(f"Successfully restarted {target['name']}. Monitor for recovery.")
else:
log("ERROR: Restart failed. Escalate to on-call lead.")
sys.exit(1)
if __name__ == "__main__":
ns = sys.argv[1] if len(sys.argv) > 1 else "production"
main(ns)
Making It Safe
Raw automation scripts that touch production are dangerous. Layer in safety:
Dry-Run Mode
DRY_RUN = os.getenv("DRY_RUN", "false").lower() == "true"
def restart_pod(namespace: str, pod_name: str) -> bool:
if DRY_RUN:
log(f"[DRY RUN] Would delete pod: {pod_name}")
return True
# ... actual restart
Always test with DRY_RUN=true first. Make this the default in non-production environments.
Confirmation Gates
For destructive actions, require explicit confirmation:
def confirm(action: str) -> bool:
response = input(f"About to: {action}\nProceed? [yes/no]: ")
return response.strip().lower() == "yes"
Or in fully automated mode, skip confirmation but log prominently and send a Slack notification.
Idempotency
Design every step to be safe to run twice:
def ensure_config_set(key: str, value: str) -> None:
"""Set config value — safe to call multiple times."""
current = get_config(key)
if current == value:
log(f"Config {key} already set to {value}, skipping.")
return
set_config(key, value)
log(f"Set {key} = {value}")
Connecting to Your Alert System
Runbooks only save time if they run automatically when the alert fires. With PagerDuty webhooks:
# PagerDuty webhook handler (Flask)
from flask import Flask, request
app = Flask(__name__)
RUNBOOK_MAP = {
"high-mem-usage": "runbooks/high_memory.py",
"disk-usage-critical": "runbooks/disk_cleanup.py",
}
@app.route("/webhook/pagerduty", methods=["POST"])
def pagerduty_webhook():
event = request.json
alert_name = event.get("alert", {}).get("summary", "")
if alert_name in RUNBOOK_MAP:
script = RUNBOOK_MAP[alert_name]
subprocess.Popen(["python3", script, "--auto"])
return {"status": "runbook_triggered", "script": script}, 200
return {"status": "no_runbook_found"}, 200
Audit Trails Are Mandatory
Every automated action in production must be logged somewhere humans can see:
import json
from pathlib import Path
AUDIT_LOG = Path("/var/log/runbooks/audit.jsonl")
def audit(action: str, target: str, outcome: str, metadata: dict = {}) -> None:
entry = {
"timestamp": datetime.utcnow().isoformat(),
"action": action,
"target": target,
"outcome": outcome,
"triggered_by": os.getenv("TRIGGERED_BY", "manual"),
**metadata
}
with AUDIT_LOG.open("a") as f:
f.write(json.dumps(entry) + "\n")
This lets you answer "what ran last night at 3 AM and why?" without waking anyone up.
Start Small
The worst mistake is trying to automate everything at once. Pick one runbook — ideally the one that fires most often — and automate just that. Run it in dry-run mode for a week. Then enable it in non-production. Then production.
Automation earns trust the same way people do: by being reliable in small things before big ones.