games:
  - type: scenario
    title: "Scale the System"
    startHealth: 5
    steps:
      - id: start
        situation: "You're designing a URL shortener (like bit.ly). It launches and goes viral — traffic spikes 100x in a week. You need a database to store short codes and long URLs. Which do you choose?"
        choices:
          - text: "PostgreSQL — ACID guarantees, easy schema, good for relational data."
            consequence: "PostgreSQL is solid but can become a write bottleneck. For a shortener, you might need horizontal scaling soon."
            health: 0
            next: database
          - text: "Redis — blazing fast, simple key-value. Short code → long URL fits perfectly."
            consequence: "Redis is great for cache, but persisting to disk and handling durability is trickier. You'd typically use Redis for cache, not primary storage alone."
            health: -1
            next: database
          - text: "PostgreSQL for durability + Redis for hot cache — best of both."
            consequence: "Smart. Durable storage plus fast reads. You can scale reads with Redis and keep writes in PostgreSQL."
            health: 1
            next: database
      - id: database
        situation: "Traffic keeps growing. Redis cache expires, and suddenly 10,000 requests hit the database at once for the same popular link. Cache stampede! How do you handle it?"
        choices:
          - text: "Let them all hit the database — it can handle it."
            consequence: "A stampede can overwhelm the DB. You need to prevent thundering herd."
            health: -2
            next: stampede
          - text: "Use probabilistic early expiration — some requests extend TTL, others repopulate cache."
            consequence: "Probabilistic early expiration (e.g. 'recompute before expire') spreads load. Good pattern."
            health: 1
            next: stampede
          - text: "Implement request coalescing — only one request fetches; others wait for it."
            consequence: "Request coalescing (or 'single-flight') ensures only one recompute. Others block and get the result. Also valid."
            health: 1
            next: stampede
      - id: stampede
        situation: "Cache stampede is under control. But global users complain about latency — requests from Asia take 200ms to reach your US datacenter. What next?"
        choices:
          - text: "Add more database replicas in the same region."
            consequence: "Replicas help DB load but don't fix network latency. Users far from the server still wait."
            health: -1
            next: cdn
          - text: "Put a CDN in front — cache short URLs at edge locations worldwide."
            consequence: "CDN caches at the edge. Most requests never hit your origin. Latency drops dramatically."
            health: 1
            next: cdn
          - text: "Move the entire system to a region closer to most users."
            consequence: "Single-region migration is disruptive. CDN gives global edge without moving everything."
            health: 0
            next: cdn
      - id: cdn
        situation: "CDN is live. But one influencer posts a link — suddenly one short code gets 80% of all traffic. Hot key! The cache node for that key is overwhelmed. How do you deal with it?"
        choices:
          - text: "Shard the cache by short code — spread hot keys across nodes."
            consequence: "Sharding by short code can still put a hot key on one node. Hot keys need different handling."
            health: -1
            next: hot_key
          - text: "Replicate the hot key to multiple cache nodes, or use consistent hashing with virtual nodes."
            consequence: "Replicating hot data or using more granular hashing spreads the load. Good approach."
            health: 1
            next: hot_key
          - text: "Ignore it — CDN should absorb most of it."
            consequence: "CDN helps, but if the key is so hot it overwhelms a single CDN node, you need a strategy."
            health: -1
            next: hot_key
      - id: hot_key
        situation: "Hot key is mitigated. Leadership says: prepare for 10x growth in 6 months. What do you prioritize?"
        choices:
          - text: "Rewrite everything in a faster language."
            consequence: "Premature optimization. Architecture and bottlenecks matter more than language for most scale problems."
            health: -2
            next: growth
          - text: "Add read replicas, consider write partitioning (e.g. by short code range), and capacity plan."
            consequence: "Solid. Scale reads, plan for write partitioning if needed, and model capacity. Pragmatic."
            health: 1
            next: growth
          - text: "Build a real-time analytics dashboard first."
            consequence: "Analytics are nice but not critical for handling 10x. Core scalability comes first."
            health: -1
            next: growth
      - id: growth
        situation: "You've scaled the shortener: PostgreSQL + Redis cache, CDN, hot-key handling, and growth plan. A new requirement: track click analytics (timestamp, geo, device) for each short code. How do you add it?"
        choices:
          - text: "Add analytics columns to the main URLs table."
            consequence: "Mixing transactional and analytical data slows both. High-volume analytics bloat the core table."
            health: -2
            next: end
          - text: "Write analytics to a separate high-throughput store (e.g. Kafka + data lake, or append-only store)."
            consequence: "Right. Separate the analytics pipeline from the core URL lookup. Event stream or append-only keeps the shortener fast."
            health: 1
            next: end
          - text: "Store analytics in Redis and batch-flush to PostgreSQL."
            consequence: "Redis can buffer, but analytics volume may overwhelm. A dedicated pipeline (Kafka, etc.) is more robust for high volume."
            health: 0
            next: end