# Incident Response — Exercises

## Exercise 1: Severity Classification

**Task:** Classify these incidents as SEV1–4 and justify: (A) Payment processing returns 500 for 10% of users. (B) Typo on "About" page. (C) Full site returns 502 for all users. (D) Search is slow (3s vs 1s) for one region.

**Validation:**
- [ ] Each has a severity
- [ ] Justification references impact and scope
- [ ] Response time is implied or stated

**Hints:**
1. Full site down → SEV1
2. Payment 500 for 10% → SEV2 (major, key feature)
3. Search slow, one region → SEV3 (degradation, workaround: wait or retry)
4. Typo → SEV4

---

## Exercise 2: Communication Template in Action

**Task:** A database failover fails at 2 p.m.; the site is down. Write the internal announcement and external status update you'd post within 5 minutes. Then write the "Resolved" update for when you fix it 45 minutes later.

**Validation:**
- [ ] Internal has: severity, impact, status, IC, ETA
- [ ] External is customer-friendly, no jargon
- [ ] Resolved update thanks users and summarizes what happened

**Hints:**
1. Internal: "SEV1, full outage, mitigating, IC: [name], ETA: investigating"
2. External: "We're experiencing an outage. Our team is on it. Next update in 15 min."
3. Resolved: "Service restored. We'll share a postmortem within 48 hours."

---

## Exercise 3: Blameless Postmortem Draft

**Task:** A deploy at 10 a.m. introduced a bug; it wasn't caught by CI. Incident resolved at 11:30 a.m. Draft the "Root Cause" section using 5 whys. Use blameless language—focus on process, not people.

**Validation:**
- [ ] 5 whys go from symptom to systemic cause
- [ ] No blame ("the process allowed" not "Alice didn't test")
- [ ] At least one action item is implied

**Hints:**
1. Why did the bug reach prod? → CI didn't catch it
2. Why didn't CI catch it? → No test for this case
3. Why no test? → Gap in test coverage / requirements
4. Keep going to process: how do we close the gap?

---

## Exercise 4: Runbook Entry

**Task:** Write a runbook entry for "Database replica lag exceeds 30 seconds." Include: how to detect, how to confirm, mitigation steps (2–4), escalation path, and where to log the incident.

**Validation:**
- [ ] Detection is specific (metric, dashboard, alert)
- [ ] Mitigation steps are ordered and actionable
- [ ] Escalation path is clear

**Hints:**
1. Detect: CloudWatch/graph, PagerDuty alert
2. Confirm: Check replica lag metric, verify impact
3. Mitigate: Identify slow query, kill if safe, scale read replicas, failover if critical
4. Escalate: DB team, IC if no progress in 15 min

---

## Exercise 5: On-Call Rotation Design

**Task:** Design an on-call rotation for a 5-person team. Include: primary and secondary, rotation cadence (e.g., weekly), handoff process, and one policy to prevent fatigue (e.g., no back-to-back primaries).

**Validation:**
- [ ] Primary and secondary defined
- [ ] Cadence and handoff are specified
- [ ] At least one fatigue-prevention policy

**Hints:**
1. Weekly primary, weekly secondary (different person)
2. Handoff: 15-min call, share recent incidents, in-flight work
3. Policy: No one does primary 2 weeks in a row; comp time after SEV1