# Incident Response — Quick Reference

## Severity Levels

| SEV | Impact | Response |
|-----|--------|----------|
| 1 | Critical: full outage, data loss | Immediate (~5 min) |
| 2 | Major: key feature broken | 15–30 min |
| 3 | Minor: limited impact, workaround | Hours |
| 4 | Low: cosmetic, edge case | Next business day |

## Roles

| Role | Responsibility |
|------|----------------|
| Incident Commander | Coordinate, decide, delegate |
| Comms Lead | Stakeholder updates, status page |
| Responders | Debug, rollback, fix |

## Lifecycle

1. **Detect** — Alert, user report
2. **Triage** — Severity, scope, assign IC
3. **Mitigate** — Rollback, fix, workaround
4. **Resolve** — Verify, all-clear
5. **Postmortem** — Blameless, 5 whys
6. **Action Items** — Track, prevent recurrence

## Communication Template

**Internal:** Severity, impact, status, IC, ETA  
**External:** "We're aware of X. Impact: Y. ETA: Z."  
**Cadence:** Every 15–30 min for SEV1/2

## Blameless Postmortem

- Summary, Timeline, Root Cause (5 whys), Impact, Action Items, Lessons
- No blame: focus on systems and process
- Action items: owner, due date

## On-Call Best Practices

- Runbooks for common incidents
- Escalation path: primary → secondary → manager
- Rotation: fair, no back-to-back primary
- Post-incident rest for IC
- Compensation / flexibility

## Error Budgets

- SLO: e.g., 99.9% success
- Error budget: allowed failure (0.1% ≈ 43 min/month)
- Incidents consume budget; postmortems guide investment in reliability
