# Incident Response Walkthrough — Learn by Doing

## Before We Begin

Incidents follow a lifecycle: detect, triage, communicate, mitigate, resolve, and learn. How you handle each phase—and who does what—determines whether you fix things quickly and improve, or repeat the same mistakes.

**Diagnostic question:** Have you (or your team) ever had something break in production? What went well? What was chaotic—alerts, communication, or knowing who owned what?

**Checkpoint:** You can name at least two phases of the incident lifecycle and one thing that often goes wrong.

---

## Step 1: Define Severity Levels

<!-- hint:diagram mermaid-type="flowchart" topic="Incident lifecycle from detect to postmortem" -->
<!-- hint:list style="cards" -->

**Task:** For your system (or a hypothetical one), define SEV1–4 with concrete examples. Include: impact description, example scenario, and target response time for each.

**Question:** How would someone know in the moment which severity applies? What's ambiguous?

**Checkpoint:** Each severity has a clear example and response time.

---

## Step 2: Draft a Communication Template

**Task:** Write two templates: (1) internal incident announcement (Slack/email) and (2) external status page update. Include placeholders for severity, impact, status, and ETA.

**Question:** What do stakeholders need to know immediately vs. what can wait? How often should you update?

**Checkpoint:** Templates are copy-pastable with clear placeholders.

---

## Step 3: Outline Incident Roles

**Task:** Define the roles for your team's incident response: Incident Commander, Comms Lead, Responders. For each, write 2–3 responsibilities. Who would fill each role in a real incident?

**Question:** What happens if the usual IC is unavailable? How do you rotate?

**Checkpoint:** Roles are defined; you know who does what and who backs up whom.

---

## Step 4: Write a Postmortem Template

**Task:** Create a blameless postmortem template. Include: Summary, Timeline, Root Cause (5 whys), Impact, Action Items (with owners), and Lessons Learned. Add 1–2 "blameless language" guidelines (how to describe what happened without naming individuals negatively).

**Question:** How do you make it safe for people to contribute honestly? What would discourage participation?

**Checkpoint:** Template is complete; language guidelines support blameless culture.

---

## Step 5: Map the Incident Lifecycle

**Task:** Draw or describe your team's incident flow from Detect → Resolve → Postmortem. Include: how alerts fire, who gets paged, how triage happens, where updates are posted, and when postmortem is scheduled.

**Question:** Where are the gaps? What would fail in a real 3 a.m. incident?

**Checkpoint:** Flow is documented; at least one gap is identified.

---

## Step 6: Design an On-Call Runbook Entry

**Task:** Pick one realistic incident type (e.g., "High error rate on API"). Write a runbook entry: symptoms, how to confirm, steps to mitigate, who to escalate to, and where to document.

**Question:** Could someone unfamiliar with the system follow this at 2 a.m.? What's missing?

**Checkpoint:** Runbook is actionable; an on-call engineer could follow it.

---

## Step 7: Plan a Post-Incident Review

**Task:** Schedule a blameless postmortem for a past incident (real or hypothetical). Write the agenda: timebox per section, who facilitates, how action items are tracked. Include one "retrospective" question: "What would we do differently in the response itself?"

**Question:** How do you avoid making the postmortem feel punitive? How do you ensure action items get done?

**Checkpoint:** Agenda is timeboxed; facilitation and follow-through are planned.
