games:
  - type: scenario
    title: "Incident Commander"
    startHealth: 5
    steps:
      - id: start
        situation: "It's 2 AM. PagerDuty wakes you — the payment service is down. Alerts are firing: 503s on checkout, payment gateway timeouts. You're the incident commander. What do you do first?"
        choices:
          - text: "Declare SEV1 immediately and page the full team."
            consequence: "You acted decisively. SEV1 is correct for customer-facing payment outages. The team is assembling."
            health: 1
            next: triage
          - text: "Wait 5 minutes to see if it self-recovers before paging anyone."
            consequence: "Every minute of payment downtime costs revenue and trust. SEV1 incidents need immediate response."
            health: -2
            next: triage
          - text: "Post in Slack and ask if anyone has seen this before."
            consequence: "Slack is too slow for payment outages. You need to formally declare severity and page. Time is critical."
            health: -1
            next: triage
      - id: triage
        situation: "The team is joining. You confirm: checkout is returning 503, payment gateway is timing out. One engineer says the gateway provider's status page shows 'degraded.' What's your next move?"
        choices:
          - text: "Create a shared doc, assign roles (comms, technical lead, scribe), and have tech lead investigate root cause."
            consequence: "Good incident hygiene. Clear roles prevent chaos. The technical lead can focus on debugging while you coordinate."
            health: 1
            next: communicate
          - text: "Have everyone start debugging in parallel."
            consequence: "Too many cooks. Without a single technical lead and clear assignments, people duplicate work and step on each other."
            health: -1
            next: communicate
          - text: "Pause and run a full architecture review before touching anything."
            consequence: "Architecture reviews are for postmortems. During an incident, you need fast triage and mitigation, not deep analysis."
            health: -2
            next: communicate
      - id: communicate
        situation: "Stakeholders are asking for updates. Support is getting flooded with tickets. The technical lead is still investigating. What do you prioritize?"
        choices:
          - text: "Send a brief status to stakeholders — 'We're investigating payment gateway issues, ETA for next update in 15 min' — then let the tech lead work."
            consequence: "Right balance. Stakeholders get reassurance; the team isn't distracted. Time-boxed updates manage expectations."
            health: 1
            next: debug_vs_comms
          - text: "Ignore stakeholders until you have a root cause."
            consequence: "Stakeholders need to know you're on it. Radio silence increases anxiety and can trigger escalations."
            health: -1
            next: debug_vs_comms
          - text: "Pull the technical lead off investigation to draft a detailed customer-facing post."
            consequence: "Investigation should take priority. Detailed posts can wait. Brief internal updates are enough for now."
            health: -2
            next: debug_vs_comms
      - id: debug_vs_comms
        situation: "The tech lead suspects a bad deploy from 2 hours ago. Rolling back would take ~20 minutes. The payment gateway provider's status still says 'degraded.' Do you rollback or wait for more data?"
        choices:
          - text: "Rollback now. Payment is critical; the deploy is the most likely culprit and we can't afford to wait."
            consequence: "Pragmatic. For payment, speed matters. If the rollback fixes it, you're done. If not, you've ruled out a major variable."
            health: 1
            next: rollback_result
          - text: "Wait for the gateway provider to confirm their status before deciding."
            consequence: "External status pages can lag. Your own deploy is something you control. Delaying the rollback extends customer impact."
            health: -1
            next: rollback_result
          - text: "Run A/B tests to confirm the deploy is the cause."
            consequence: "A/B tests take time. During an incident, you need fast, reversible actions. Rollback is low risk and high signal."
            health: -2
            next: rollback_result
      - id: rollback_result
        situation: "You rolled back. Checkout recovers. The incident is resolved. Now the CEO asks for a postmortem by end of week. How do you approach it?"
        choices:
          - text: "Schedule a blameless postmortem with everyone involved. Focus on what we'll change (process, tooling, checks), not who screwed up."
            consequence: "Blameless postmortems build learning culture. People share more, and you get better preventative measures."
            health: 1
            next: postmortem
          - text: "Assign it to the engineer who pushed the deploy."
            consequence: "That creates blame and discourages future transparency. Postmortems should be collaborative and blameless."
            health: -2
            next: postmortem
          - text: "Skip the postmortem — we fixed it, let's move on."
            consequence: "Every incident is a learning opportunity. Skipping postmortems means the same failures repeat."
            health: -1
            next: postmortem
      - id: postmortem
        situation: "In the postmortem, the team identifies: no canary for the payment service, and deploy went out without a staging gate. What should you document?"
        choices:
          - text: "Write action items: add canary deployment, require staging validation for payment service, set up alerts for gateway timeouts."
            consequence: "Well done, Incident Commander. Clear, actionable items. You triaged severity, coordinated the team, communicated with stakeholders, made a timely rollback decision, and ran a blameless postmortem."
            health: 1
            next: end
          - text: "Document that 'we'll be more careful next time.'"
            consequence: "Vague. 'Be more careful' doesn't change systems. You need concrete process or tooling changes."
            health: -1
            next: end
          - text: "Blame the engineer who merged without sufficient review."
            consequence: "Blameless means focusing on system failures, not individuals. Blame stops people from participating honestly."
            health: -2
            next: end

  - type: classify
    title: "Severity Sorter"
    categories:
      - name: "SEV1 - Critical"
        color: "#f85149"
      - name: "SEV2 - Major"
        color: "#d29922"
      - name: "SEV3 - Minor"
        color: "#58a6ff"
      - name: "SEV4 - Low"
        color: "#8b949e"
    items:
      - text: "All checkouts failing, payment service returning 503. Revenue impact."
        category: "SEV1 - Critical"
      - text: "Authentication service down — no one can log in."
        category: "SEV1 - Critical"
      - text: "Database primary unreachable; replicas serving read-only. Writes failing."
        category: "SEV1 - Critical"
      - text: "Search is slow (2–3s) but returning results. Some users complaining."
        category: "SEV2 - Major"
      - text: "CDN cache miss rate up 20%. Page loads slightly slower."
        category: "SEV2 - Major"
      - text: "Admin dashboard export fails for reports over 10k rows."
        category: "SEV2 - Major"
      - text: "Non-critical feature flag not applying for a small segment."
        category: "SEV3 - Minor"
      - text: "Email notifications delayed by 5–10 minutes."
        category: "SEV3 - Minor"
      - text: "Minor UI glitch on settings page in Safari only."
        category: "SEV3 - Minor"
      - text: "Spelling error in FAQ page."
        category: "SEV4 - Low"
      - text: "Deprecated API endpoint returns 410 as expected; one client not updated."
        category: "SEV4 - Low"
      - text: "Analytics pipeline backlog of 1 hour; dashboards slightly stale."
        category: "SEV4 - Low"
