# Incident Response — From Alert to Postmortem

<!-- hint:slides topic="Incident response lifecycle: severity levels, roles (IC, Comms Lead), triage, mitigation, resolution, and blameless postmortem" slides="6" -->

## Incident Severity Levels

Define severity so everyone knows how to respond. Common tiers:

| Severity | Impact | Example | Response Time |
|----------|--------|---------|---------------|
| **SEV1** | Critical: full outage, data loss, security breach | Site down, payment broken | Immediate (e.g., 5 min) |
| **SEV2** | Major: significant degradation, key feature broken | Search down, 50% errors | Within 15–30 min |
| **SEV3** | Minor: limited impact, workaround exists | One region slow, non-critical bug | Within hours |
| **SEV4** | Low: cosmetic or edge case | UI glitch, rare error | Next business day |

Customize thresholds for your system. Document them in your runbooks so on-call knows when to escalate.

## Roles During Incidents

| Role | Responsibility |
|------|----------------|
| **Incident Commander (IC)** | Owns the response; coordinates, decides, delegates. Does not fix—orchestrates. |
| **Comms Lead** | Updates stakeholders, status page, and internal channels. Frees IC to focus on resolution. |
| **Responders** | Engineers debugging, rolling back, or applying fixes. Report progress to IC. |

IC and Comms Lead should be different people when possible. IC stays focused on technical decisions; Comms keeps everyone informed.

## The Incident Lifecycle

```mermaid
flowchart LR
    A[Detect] --> B[Triage]
    B --> C[Respond]
    C --> D[Mitigate]
    D --> E[Resolve]
    E --> F[Postmortem]
```

```mermaid
flowchart TD
  A[Detect] --> B[Triage]
  B --> C[Mitigate]
  C --> D[Resolve]
  D --> E[Postmortem]
  E --> F[Action Items]

  A -.->|Alert, user report| A
  B -.->|Severity, scope| B
  C -.->|Rollback, fix, workaround| C
  D -.->|Verify, all-clear| D
  E -.->|Blameless, 5 whys| E
  F -.->|Track, prevent recurrence| F
```

1. **Detect** — Alert fires, user reports, or monitoring catches the issue.
2. **Triage** — Assess severity, scope, and impact. Assign IC and responders.
3. **Mitigate** — Roll back, apply fix, or implement workaround. Goal: reduce impact.
4. **Resolve** — Verify the fix, confirm stability, declare incident closed.
5. **Postmortem** — Blameless analysis: what happened, why, what we'll do differently.
6. **Action Items** — Track improvements (runbooks, monitoring, code changes).

## Communication Templates

**Internal (Slack, email):**
> **Incident: [Brief title]**  
> **Severity:** SEV[1–4]  
> **Impact:** [Who/what is affected]  
> **Status:** Investigating / Mitigating / Resolved  
> **ETA:** [If known]  
> **Incident Commander:** [Name]

**External (status page, customers):**
> **We're aware of [issue].** Impact: [description]. We're investigating and will update within [time]. ETA: [if known].

**Update cadence:** Every 15–30 min for SEV1/2; don't leave people guessing. "No update" is an update—say "Still investigating, next update in 15 min."

## Blameless Postmortems

**Goal:** Learn, not blame. Focus on systems and process, not individuals.

**Structure:**
1. **Summary** — What happened, in plain language.
2. **Timeline** — Key events with timestamps.
3. **Root cause** — Use "5 whys" or similar. Go past symptoms to contributing factors.
4. **Impact** — Users affected, duration, business impact.
5. **Action items** — What we'll change (runbooks, monitoring, code, process). Owner and due date.
6. **Lessons learned** — What went well, what didn't.

**Rule:** No names in the "who messed up" sense. "The deploy went out" not "Alice deployed the bad code." Discuss the decision-making and systems that allowed the failure.

## On-Call Best Practices

- **Runbooks** — Step-by-step guides for common incidents. "If X, do Y."
- **Escalation paths** — Who gets paged when: primary → secondary → manager. Define SLA.
- **Rotation** — Fair distribution; avoid burnout. Tools: PagerDuty, OpsGenie, VictorOps.
- **Compensation** — On-call pay, time off, or flexibility. Make it sustainable.
- **Training** — Shadow new on-call; run game days (simulated incidents).

## Preventing Incident Fatigue

- Limit consecutive weeks on primary.
- Handoff meetings: what’s in flight, recent changes.
- Post-incident rest: no meetings for IC for a few hours after SEV1/2.
- Track pages: if someone is woken up constantly, fix the alerts or the system.

## Learning from Incidents — SLOs and Error Budgets

**SLO (Service Level Objective):** "99.9% of requests succeed."  
**Error budget:** The allowed failure (e.g., 0.1% = ~43 min downtime/month). When you're within budget, you can ship; when you're over, focus on reliability.

Incidents consume the error budget. Use postmortems to decide: was this a one-off or a systemic problem? Invest in fixes that prevent recurrence and protect the budget.

## Incident Lifecycle Flow (Simplified)

```mermaid
flowchart LR
  A[Alert] --> B[Tri age]
  B --> C[Mitigate]
  C --> D[Resolve]
  D --> E[Postmortem]
  E --> A
```

Continuous improvement: each incident makes the system more resilient if we learn from it.
