--- command: devops:incident-response description: "devops:incident-response" --- # devops:incident-response Guided incident response workflow for production outages with automated diagnostics, intelligent decision support, and remediation recommendations. ## Description Execute comprehensive incident response workflows with automated diagnostics, severity-based escalation, and intelligent remediation recommendations. This command provides: - **Severity Classification**: P0/P1/P2/P3 workflows with automated escalation - **Automated Diagnostics**: Service health, deployment analysis, log correlation, metrics anomaly detection - **Decision Intelligence**: Rollback vs hotfix recommendations with confidence scoring - **Timeline Documentation**: Auto-generated incident timeline with key events - **Communication Templates**: Stakeholder status updates and post-mortem reports - **Remediation Automation**: Guided or automated rollback with impact assessment ## Required Documentation Access **MANDATORY:** Before incident response, query Context7 for best practices: **Documentation Queries:** - `mcp://context7/observability/incident-management` - Incident response frameworks and SRE practices - `mcp://context7/kubernetes/troubleshooting` - K8s debugging, pod diagnostics, and log analysis - `mcp://context7/prometheus/alerting` - Alert investigation and metrics correlation - `mcp://context7/distributed-tracing/debugging` - Tracing-based root cause analysis - `mcp://context7/devops/runbooks` - Runbook automation and decision tree patterns **Why This is Required:** - Ensures incident response follows SRE best practices and incident command structure - Applies proven diagnostic workflows for Kubernetes and microservices architectures - Validates rollback/remediation decisions against production safety standards - Prevents premature or incorrect remediation actions during high-pressure incidents - Implements proper communication patterns for stakeholder updates ## Usage ```bash /devops:incident-response [options] ``` ## Options - `--severity ` - Incident severity classification (required) - `critical` (P0) - Production down, auto-page on-call, immediate action - `high` (P1) - Major degradation, escalation path, guided investigation - `medium` (P2) - Partial impact, standard investigation workflow - `low` (P3) - Minor issue, document and schedule fix - `--service ` - Affected service name (required) - Examples: `payment-api`, `auth-service`, `user-database` - `--symptoms ` - Incident symptoms (optional) - Examples: "500 errors", "high latency", "pod crashes" - `--namespace ` - Kubernetes namespace (default: production) - `--analyze-deployment ` - Analyze specific deployment version - Checks deployment correlation with incident timeline - `--rollback` - Execute automated rollback - Requires `--target` version to be specified - Performs safety checks before rollback - `--target ` - Target version for rollback - Examples: `v2.3.0`, `latest-stable`, `previous` - `--timeframe ` - Analysis timeframe (default: 4h) - Examples: `1h`, `4h`, `24h` - `--post-mortem` - Generate post-mortem report after resolution - Creates RCA document with timeline and findings - `--dry-run` - Preview diagnostics without executing actions ## Examples ### Critical P0 Incident - Service Down ```bash /devops:incident-response --severity critical --service payment-api ``` **Output:** ``` 🚨 INCIDENT RESPONSE - P0 CRITICAL ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 📋 Incident Details Service: payment-api Severity: P0 (Critical) Namespace: production Started: 2025-10-21 14:23:15 UTC 📊 Service Health ❌ payment-api: 3/5 pods failing (CrashLoopBackOff) ⚠️ Database: Connection pool exhausted (95/100) ✓ Redis cache: Healthy ✓ Auth service: Healthy 🕐 Timeline 14:23:15 - Deployment v2.3.1 started 14:23:45 - First pod failed health check 14:24:10 - Error rate spiked 0.1% → 45% 14:24:30 - Database connections maxed out 14:25:00 - Auto-scaling triggered (5 → 10 pods) 14:25:15 - All new pods failing startup 📈 Metrics Anomaly Detection ⚠️ Error rate: 45% (baseline: 0.1%) - 450x increase 🔴 Latency P95: 8.5s (baseline: 150ms) - 56x increase 🔴 Database query time: 2.3s (baseline: 50ms) - 46x increase ⚠️ Memory usage: 85% (baseline: 60%) - gradual increase ⚠️ CPU usage: 40% (baseline: 25%) - moderate increase 🔍 Recent Deployments ✓ v2.3.1 deployed 2 minutes ago by @devops-bot - Changed: payment processor logic - Added: new database query in payment validation - 🔴 Correlation: 95% confidence deployment caused incident 📝 Log Analysis (Last 15min) ERROR: Database connection timeout (823 occurrences) ERROR: Payment validation failed (412 occurrences) ERROR: Health check failed /health (156 occurrences) WARN: Circuit breaker opened for database (45 occurrences) 🔗 Distributed Tracing ⚠️ Trace analysis shows bottleneck in payment validation 🔴 Database query: SELECT * FROM transactions WHERE user_id = ? (missing index) ⚠️ Average trace duration: 8.2s (baseline: 120ms) 💡 Recommended Action ➜ IMMEDIATE ROLLBACK to v2.3.0 (last stable) Confidence: 95% Reasoning: - Deployment correlation: Strong (2 min before incident) - Blast radius: High (45% error rate) - Known good state: v2.3.0 stable for 3 days - Root cause: Missing index on transactions.user_id 📋 Investigation Findings - Recent code change in payment processor - Database query timeout increased 10x (50ms → 2.3s) - Missing index on transactions.user_id causing full table scan - All pods failing due to database connection pool exhaustion - Suggested fix: Add index on transactions.user_id 🎯 Impact Assessment Affected users: ~2,400 active sessions Failed transactions: 1,247 in last 5 minutes Revenue impact: $18,500 estimated (based on avg transaction value) 🚀 Automated Actions Available [ ] Execute rollback to v2.3.0 [ ] Scale down to stop new pod creation [ ] Enable maintenance mode [ ] Page on-call engineer (auto-paged) 🔄 Next Steps [ ] Execute rollback (recommended) [ ] Notify stakeholders via #incidents channel [ ] Prepare hotfix with database index [ ] Schedule post-mortem for tomorrow [ ] Update runbook with lessons learned 💬 Communication Template Subject: [P0] Payment API Outage - Rollback in Progress Status: Active incident - Rollback initiated Service: Payment API (payment-api) Impact: 45% error rate, payment processing degraded Action: Rolling back deployment v2.3.1 → v2.3.0 ETA: 3-5 minutes to full restoration Root Cause: Missing database index causing query timeout Next Update: In 10 minutes ``` ### High P1 Incident with Custom Symptoms ```bash /devops:incident-response --severity high --service auth-service --symptoms "500 errors, high latency" ``` **Output:** ``` ⚠️ INCIDENT RESPONSE - P1 HIGH ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 📋 Incident Details Service: auth-service Severity: P1 (High) Symptoms: 500 errors, high latency Namespace: production 📊 Service Health ⚠️ auth-service: 4/5 pods healthy (1 pod in Warning state) ✓ Database: Healthy ⚠️ Redis cache: High memory usage (88%) ✓ API gateway: Healthy 🔍 Diagnostic Decision Tree Symptom: 500 errors + high latency ├─ Recent deployment? NO (last deployment 2 days ago) ├─ External dependency issue? Checking... │ └─ Redis memory: 88% (threshold: 85%) └─ Recommended path: INVESTIGATE DEPENDENCIES 📈 Metrics Analysis ⚠️ Error rate: 8% (baseline: 0.5%) - 16x increase ⚠️ Latency P95: 1.2s (baseline: 80ms) - 15x increase 🔴 Redis cache miss rate: 45% (baseline: 5%) - cache degradation ⚠️ Redis memory: 88% (approaching limit) 💡 Recommended Action ➜ INVESTIGATE Redis cache degradation (not deployment-related) Confidence: 75% Reasoning: - No recent deployments - Cache miss rate spike correlates with latency - Redis memory approaching limit - Gradual degradation over 30 minutes 🔧 Suggested Remediation 1. Increase Redis memory limit (quick fix) 2. Review cache eviction policy 3. Analyze cache key distribution 4. Consider cache warming strategy 🔄 Next Steps [ ] Increase Redis memory allocation [ ] Analyze cache keys with redis-cli [ ] Review auth-service caching logic [ ] Monitor error rate for improvement [ ] Escalate to P0 if no improvement in 15min ``` ### Analyze Specific Deployment ```bash /devops:incident-response --severity medium --service user-api --analyze-deployment v3.1.5 ``` **Output:** ``` 📊 DEPLOYMENT ANALYSIS - P2 MEDIUM ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 🔍 Deployment: v3.1.5 Service: user-api Deployed: 2025-10-21 10:15:30 UTC (4h 12m ago) By: @engineer-alice 📈 Performance Comparison (v3.1.4 → v3.1.5) Metrics Before (v3.1.4): - Error rate: 0.2% - Latency P95: 95ms - Request rate: 1,200 req/s - Memory usage: 45% Metrics After (v3.1.5): - Error rate: 0.8% (+300% ⚠️) - Latency P95: 145ms (+52% ⚠️) - Request rate: 1,180 req/s (-1.6% ✓) - Memory usage: 52% (+15% ⚠️) 🔗 Deployment Correlation ⚠️ Medium confidence (65%) that deployment contributed to degradation Evidence: - Error rate increased 30min after deployment - Latency degradation is gradual, not immediate - Memory usage trending up slowly 💡 Recommendation ➜ MONITOR for additional 2 hours before rollback decision Reasoning: - Degradation is moderate (not critical) - No customer impact reports yet - Possible memory leak pattern (gradual increase) - Recommendation: Set up automated rollback if error rate > 2% 🔧 Monitoring Plan [ ] Set alert: Error rate > 2% → auto-rollback [ ] Set alert: Memory > 80% → investigate [ ] Review v3.1.5 code changes for memory leaks [ ] Prepare rollback plan to v3.1.4 ``` ### Execute Automated Rollback ```bash /devops:incident-response --severity critical --service payment-api --rollback --target v2.3.0 ``` **Output:** ``` 🔄 AUTOMATED ROLLBACK - P0 CRITICAL ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 📋 Rollback Plan Service: payment-api Current: v2.3.1 (failing) Target: v2.3.0 (last stable) Strategy: Rolling update (zero-downtime) ✓ Pre-Rollback Safety Checks ✓ Target version v2.3.0 exists in registry ✓ v2.3.0 was stable for 72 hours ✓ Database schema compatible ✓ No breaking API changes detected ✓ Rollback permission granted (auto-approved for P0) 🚀 Executing Rollback [████████████████████████░░░░░░] 70% - Rolling back pods (3/5) 14:26:10 - Rollback initiated 14:26:15 - Scaled down v2.3.1: 5 → 4 pods 14:26:25 - Scaled up v2.3.0: 0 → 1 pod (healthy) 14:26:35 - Scaled down v2.3.1: 4 → 3 pods 14:26:40 - Scaled up v2.3.0: 1 → 2 pods (healthy) 14:26:50 - Scaled down v2.3.1: 3 → 2 pods 14:26:55 - Scaled up v2.3.0: 2 → 3 pods (healthy) Estimated time remaining: 1m 30s ✓ Rollback Complete Final state: ✓ payment-api v2.3.0: 5/5 pods healthy ✓ Error rate: 0.1% (back to baseline) ✓ Latency P95: 140ms (back to baseline) ✓ Database connections: 25/100 (normalized) 📊 Incident Resolution Total duration: 4 minutes 15 seconds MTTR achieved: Excellent (target: <10min for P0) 🔄 Next Steps [ ] Notify stakeholders of resolution [ ] Prepare hotfix for v2.3.1 issues [ ] Add database index for user_id [ ] Test hotfix in staging [ ] Schedule post-mortem meeting ``` ### Generate Post-Mortem Report ```bash /devops:incident-response --severity critical --service payment-api --post-mortem ``` **Output:** ``` 📋 POST-MORTEM REPORT ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Incident: Payment API Outage (P0) Date: 2025-10-21 Service: payment-api Duration: 4 minutes 15 seconds Incident Commander: @devops-bot (automated) ═══════════════════════════════════════════════════ ## Executive Summary The payment-api service experienced a critical (P0) outage on October 21, 2025, from 14:23 to 14:27 UTC (4m 15s duration). The incident was caused by a missing database index in deployment v2.3.1, resulting in 45% error rate and affecting approximately 2,400 active users with $18,500 estimated revenue impact. Automated rollback to v2.3.0 resolved the incident within target MTTR (<10min). ## Timeline 14:23:15 - Deployment v2.3.1 started 14:23:45 - First pod health check failure detected 14:24:10 - Error rate spiked from 0.1% to 45% 14:24:30 - Database connection pool exhausted (95/100 connections) 14:25:00 - Kubernetes auto-scaling triggered (5 → 10 pods) 14:25:15 - All new pods failing startup probes 14:25:30 - Incident classified as P0 Critical 14:25:45 - On-call engineer paged automatically 14:26:00 - Decision: Rollback to v2.3.0 14:26:10 - Automated rollback initiated 14:27:25 - Rollback complete, service restored 14:27:40 - Incident resolved ## Root Cause Analysis **Primary Cause:** Deployment v2.3.1 introduced a new database query in payment validation logic that performed a full table scan due to missing index on transactions.user_id column. This caused query execution times to increase from 50ms to 2.3s. **Contributing Factors:** 1. Database index not created in migration scripts 2. Staging database had smaller dataset, did not expose performance issue 3. Load testing did not include realistic transaction volume 4. Database query performance not monitored in CI/CD pipeline **Detection:** Automated monitoring detected error rate spike within 55 seconds of incident start. Incident response system correctly identified deployment correlation (95% confidence). ## Impact Assessment **User Impact:** - Affected users: ~2,400 active sessions - Failed transactions: 1,247 over 4-minute period - Error rate: 45% (baseline: 0.1%) - Latency P95: 8.5s (baseline: 150ms) **Business Impact:** - Estimated revenue impact: $18,500 (based on failed transactions) - Customer support tickets: 12 opened during incident - No customer data lost or corrupted - SLA breach: No (resolved within 10min P0 target) ## What Went Well ✓ Automated detection within 55 seconds ✓ Deployment correlation identified correctly (95% confidence) ✓ Automated rollback executed without manual intervention ✓ MTTR achieved target (<10min for P0) ✓ Zero-downtime rollback strategy worked correctly ✓ Communication templates sent automatically ## What Went Wrong ✗ Missing index not detected in code review ✗ Staging environment data volume too small to catch issue ✗ Load testing did not include database performance validation ✗ No query performance monitoring in CI/CD ## Action Items | Action | Owner | Priority | Due Date | Status | |--------|-------|----------|----------|--------| | Add database index on transactions.user_id | @db-team | P0 | 2025-10-22 | Pending | | Update migration to include index | @backend-team | P0 | 2025-10-22 | Pending | | Add query performance monitoring to CI/CD | @devops-team | P1 | 2025-10-25 | Pending | | Increase staging database to production-like volume | @infra-team | P1 | 2025-10-28 | Pending | | Add database query analyzer to code review checklist | @engineering-manager | P2 | 2025-10-30 | Pending | | Document lessons learned in runbook | @incident-commander | P2 | 2025-10-23 | Pending | ## Lessons Learned 1. **Database Performance Testing:** Always test with production-like data volumes 2. **Index Analysis:** Automated detection of missing indexes in CI/CD 3. **Query Monitoring:** Add database query performance to observability stack 4. **Code Review:** Enhanced checklist for database query changes 5. **Automated Response:** Incident response automation reduced MTTR significantly ## Recommendations **Immediate (This Week):** - Deploy hotfix v2.3.2 with database index - Update staging environment with realistic data volume - Add query performance gates to CI/CD pipeline **Short-term (This Month):** - Implement database query analyzer in static analysis - Enhance load testing to include database performance scenarios - Add slow query monitoring to Prometheus/Grafana **Long-term (This Quarter):** - Evaluate database query optimization tools - Consider implementing query caching layer - Review all services for similar index optimization opportunities ═══════════════════════════════════════════════════ Report generated: 2025-10-21 14:30:00 UTC Generated by: @devops-bot (incident-response automation) Report saved to: incidents/2025-10-21-payment-api-outage.md ``` ## Implementation This command uses the **@observability-engineer**, **@kubernetes-orchestrator**, and **@docker-containerization-expert** agents for comprehensive incident response. ### Step 1: Query Context7 for Best Practices Before execution, the command queries Context7 for: - Incident management frameworks (ITIL, SRE practices) - Kubernetes troubleshooting patterns - Alert correlation and metrics analysis - Distributed tracing debugging techniques - Runbook automation and decision trees ### Step 2: Severity Classification and Workflow Selection **P0 (Critical) Workflow:** ```yaml Severity: P0 - Production Down Actions: - Auto-page on-call engineer - Escalate to incident commander - Enable war room / incident bridge - Automated rollback option available - Real-time stakeholder updates every 10min - MTTR target: <10 minutes ``` **P1 (High) Workflow:** ```yaml Severity: P1 - Major Degradation Actions: - Notify on-call engineer - Escalation path: Manager → Director - Guided investigation workflow - Manual rollback decision required - Stakeholder updates every 30min - MTTR target: <1 hour ``` **P2 (Medium) Workflow:** ```yaml Severity: P2 - Partial Impact Actions: - Create incident ticket - Standard investigation workflow - Monitor for escalation to P1 - Stakeholder notification optional - MTTR target: <4 hours ``` **P3 (Low) Workflow:** ```yaml Severity: P3 - Minor Issue Actions: - Document incident - Schedule fix in next sprint - No immediate action required - Track as tech debt ``` ### Step 3: Automated Diagnostics **Service Health Checks (Kubernetes):** ```bash # Uses @kubernetes-orchestrator agent # Check pod status kubectl get pods -n $NAMESPACE -l app=$SERVICE -o wide # Check deployment status kubectl describe deployment $SERVICE -n $NAMESPACE # Check recent events kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | tail -20 # Check pod logs (last 1 hour) kubectl logs --since=1h -l app=$SERVICE -n $NAMESPACE --tail=1000 # Check resource usage kubectl top pods -n $NAMESPACE -l app=$SERVICE ``` **Deployment Analysis (Last 4 Hours):** ```bash # Check recent deployments via git log git log --since="4 hours ago" --oneline --decorate # Check Kubernetes deployment history kubectl rollout history deployment/$SERVICE -n $NAMESPACE # Check deployment events kubectl describe deployment $SERVICE -n $NAMESPACE | grep -A 10 Events ``` **Error Log Correlation:** ```bash # Aggregate errors from all pods kubectl logs -l app=$SERVICE -n $NAMESPACE --since=15m | grep -i "error\|exception\|fatal" | sort | uniq -c | sort -rn # Pattern detection kubectl logs -l app=$SERVICE -n $NAMESPACE --since=15m | awk '/ERROR/ {print $0}' | sed 's/[0-9]\{1,\}/NUM/g' | sort | uniq -c | sort -rn ``` **Metrics Anomaly Detection (Prometheus Queries):** ```promql # Error rate spike detection ( rate(http_requests_total{job="$SERVICE",status=~"5.."}[5m]) / rate(http_requests_total{job="$SERVICE"}[5m]) ) / ( rate(http_requests_total{job="$SERVICE",status=~"5.."}[1h] offset 1d) / rate(http_requests_total{job="$SERVICE"}[1h] offset 1d) ) # Latency anomaly (P95 vs baseline) histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job="$SERVICE"}[5m])) / histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job="$SERVICE"}[1h] offset 1d)) # Memory usage spike ( container_memory_usage_bytes{pod=~"$SERVICE.*"} / container_spec_memory_limit_bytes{pod=~"$SERVICE.*"} ) / ( avg_over_time(container_memory_usage_bytes{pod=~"$SERVICE.*"}[1d] offset 1d) / container_spec_memory_limit_bytes{pod=~"$SERVICE.*"} ) ``` **Distributed Tracing Analysis (Jaeger):** ```bash # Query Jaeger for error traces curl -s "http://jaeger:16686/api/traces?service=$SERVICE&tags={\"error\":\"true\"}&lookback=1h&limit=100" # Find slow traces (>1s) curl -s "http://jaeger:16686/api/traces?service=$SERVICE&minDuration=1000000&lookback=1h" # Analyze trace dependencies curl -s "http://jaeger:16686/api/dependencies?service=$SERVICE&endTs=$(date +%s)000&lookback=3600000" ``` **Dependency Health Checks:** ```bash # Check database connectivity kubectl exec -it deployment/$SERVICE -n $NAMESPACE -- nc -zv $DB_HOST $DB_PORT # Check Redis connectivity kubectl exec -it deployment/$SERVICE -n $NAMESPACE -- redis-cli -h $REDIS_HOST ping # Check external API dependencies kubectl exec -it deployment/$SERVICE -n $NAMESPACE -- curl -s -o /dev/null -w "%{http_code}" https://api.external.com/health ``` ### Step 4: Decision Tree Logic **Troubleshooting Decision Tree:** ```yaml Decision Tree: - Symptom: High error rate (>5%) ├─ Recent deployment? (last 4 hours) │ └─ YES → Confidence: 85% → Recommend: ROLLBACK │ - Check: Deployment timestamp vs error spike timestamp │ - Check: Error patterns match new code paths │ - Action: Automated rollback to last stable version │ └─ NO → Confidence: 60% → Recommend: INVESTIGATE DEPENDENCIES │ - Check: External API health │ - Check: Database performance │ - Action: Analyze dependency latency and error rates - Symptom: High latency (P95 >500ms) ├─ Gradual degradation? (over 30+ minutes) │ └─ YES → Suspect: Resource leak (memory/CPU) │ - Check: Memory trend over last 4 hours │ - Check: CPU usage trend │ - Action: Restart pods or scale horizontally │ └─ NO → Suspect: Sudden load spike or deployment │ - Check: Request rate change │ - Check: Recent deployment │ - Action: Scale up or rollback - Symptom: Pod crashes (CrashLoopBackOff) └─ Check: Startup probe failures ├─ Database connectivity issue? │ └─ Check: Database connection pool │ └─ Action: Increase connection limit or fix connection leak ├─ Memory limit exceeded? │ └─ Check: OOMKilled events │ └─ Action: Increase memory limit └─ Application startup error? └─ Check: Pod logs for stack traces └─ Action: Rollback or apply hotfix ``` ### Step 5: Rollback Recommendation Engine **Rollback Confidence Scoring:** ```javascript // Pseudo-code for rollback recommendation function calculateRollbackConfidence(diagnostics) { let confidence = 0; // Deployment correlation (max 40 points) if (diagnostics.deploymentAge < '15 minutes') { confidence += 40; } else if (diagnostics.deploymentAge < '1 hour') { confidence += 30; } else if (diagnostics.deploymentAge < '4 hours') { confidence += 15; } // Error rate increase (max 30 points) const errorRateMultiple = diagnostics.errorRate / diagnostics.baselineErrorRate; if (errorRateMultiple > 10) { confidence += 30; } else if (errorRateMultiple > 5) { confidence += 20; } else if (errorRateMultiple > 2) { confidence += 10; } // Latency increase (max 15 points) const latencyMultiple = diagnostics.latencyP95 / diagnostics.baselineLatency; if (latencyMultiple > 5) { confidence += 15; } else if (latencyMultiple > 3) { confidence += 10; } else if (latencyMultiple > 2) { confidence += 5; } // Known stable version (max 15 points) if (diagnostics.targetVersion.stableFor > '72 hours') { confidence += 15; } else if (diagnostics.targetVersion.stableFor > '24 hours') { confidence += 10; } else if (diagnostics.targetVersion.stableFor > '6 hours') { confidence += 5; } // Total: 0-100% return confidence; } // Recommendation based on confidence if (confidence >= 80) { return "IMMEDIATE ROLLBACK (High Confidence)"; } else if (confidence >= 60) { return "ROLLBACK RECOMMENDED (Medium Confidence)"; } else if (confidence >= 40) { return "MONITOR AND PREPARE ROLLBACK (Low Confidence)"; } else { return "INVESTIGATE FURTHER (Deployment likely not the cause)"; } ``` ### Step 6: Impact Assessment **User Impact Calculation:** ```bash # Active sessions affected active_sessions=$(kubectl exec -it deployment/$SERVICE -- curl -s http://localhost:8080/metrics | grep active_sessions | awk '{print $2}') # Failed requests in last 5 minutes failed_requests=$(curl -s "http://prometheus:9090/api/v1/query?query=increase(http_requests_total{job=\"$SERVICE\",status=~\"5..\"}[5m])" | jq -r '.data.result[0].value[1]') # Estimated revenue impact (example) avg_transaction_value=15.00 revenue_impact=$(echo "$failed_requests * $avg_transaction_value" | bc) ``` ### Step 7: Automated Rollback Execution **Rolling Rollback Strategy (Zero-Downtime):** ```bash # Uses @kubernetes-orchestrator agent # Step 1: Verify target version exists if ! kubectl get deployment $SERVICE -n $NAMESPACE -o jsonpath='{.metadata.annotations.deployment\.kubernetes\.io/revision}' | grep -q $TARGET_VERSION; then echo "Error: Target version $TARGET_VERSION not found" exit 1 fi # Step 2: Safety checks echo "Running pre-rollback safety checks..." # Check database schema compatibility # Check API compatibility # Check no breaking changes # Step 3: Execute rolling rollback kubectl rollout undo deployment/$SERVICE -n $NAMESPACE --to-revision=$TARGET_VERSION # Step 4: Monitor rollback progress kubectl rollout status deployment/$SERVICE -n $NAMESPACE --timeout=5m # Step 5: Verify service health kubectl get pods -n $NAMESPACE -l app=$SERVICE # Step 6: Validate metrics returned to baseline # Check error rate, latency, etc. ``` ### Step 8: Communication Templates **P0 Stakeholder Update Template:** ```markdown Subject: [P0] {{ SERVICE }} Outage - {{ STATUS }} Status: {{ ACTIVE | INVESTIGATING | RESOLVED }} Service: {{ SERVICE_NAME }} Impact: {{ ERROR_RATE }}% error rate, {{ AFFECTED_USERS }} users affected Started: {{ INCIDENT_START_TIME }} Duration: {{ INCIDENT_DURATION }} Current Action: {{ CURRENT_ACTION }} ETA: {{ ESTIMATED_RESOLUTION_TIME }} Root Cause: {{ PRELIMINARY_ROOT_CAUSE }} Next Update: In {{ UPDATE_INTERVAL }} minutes Incident Commander: {{ COMMANDER }} War Room: {{ WAR_ROOM_LINK }} ``` ### Step 9: Post-Mortem Generation **Automated RCA Document:** ```markdown # Post-Mortem Template (Auto-Generated) ## Incident Summary - **Incident ID**: {{ INCIDENT_ID }} - **Severity**: {{ SEVERITY }} - **Duration**: {{ DURATION }} - **Service**: {{ SERVICE }} - **Date**: {{ DATE }} ## Timeline {{ AUTO_GENERATED_TIMELINE }} ## Root Cause Analysis {{ DIAGNOSTIC_FINDINGS }} ## Impact Assessment - **Affected Users**: {{ USER_COUNT }} - **Failed Transactions**: {{ TRANSACTION_COUNT }} - **Revenue Impact**: {{ REVENUE_IMPACT }} - **SLA Breach**: {{ SLA_STATUS }} ## What Went Well {{ POSITIVE_FINDINGS }} ## What Went Wrong {{ NEGATIVE_FINDINGS }} ## Action Items {{ AUTO_GENERATED_ACTION_ITEMS }} ## Lessons Learned {{ LESSONS_LEARNED }} ``` ## Output Format ``` 🚨 INCIDENT RESPONSE - P0 CRITICAL ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 📋 Incident Details 📊 Service Health 🕐 Timeline 📈 Metrics Anomaly Detection 🔍 Recent Deployments 📝 Log Analysis 🔗 Distributed Tracing 💡 Recommended Action 📋 Investigation Findings 🎯 Impact Assessment 🚀 Automated Actions Available 🔄 Next Steps 💬 Communication Template ``` ## Best Practices Applied Based on Context7 documentation and SRE incident management practices: 1. **Incident Command Structure** - Clear roles, escalation paths, war room coordination 2. **Automated Detection** - Sub-minute detection with deployment correlation 3. **Decision Intelligence** - Confidence-scored recommendations, not binary decisions 4. **Zero-Downtime Rollback** - Rolling update strategy preserves availability 5. **Communication Cadence** - Regular updates aligned with severity (P0: 10min, P1: 30min) 6. **Blameless Post-Mortems** - Focus on systems and processes, not individuals 7. **Runbook Automation** - Decision trees codified for consistent response 8. **Observability Integration** - Metrics, logs, traces correlated automatically 9. **Impact Quantification** - User count, revenue impact, SLA breach tracking 10. **Learning Culture** - Action items tracked, lessons documented ## Related Commands - `/devops:monitoring-setup` - Setup comprehensive monitoring stack - `/devops:alerting-configure` - Configure intelligent alerting rules - `/devops:runbook-create` - Create automated runbooks for common incidents - `/k8s:rollback` - Manual Kubernetes rollback operations ## Related Agents - **@observability-engineer** - Primary agent for diagnostics and monitoring analysis - **@kubernetes-orchestrator** - Kubernetes diagnostics, pod health, rollback execution - **@docker-containerization-expert** - Container diagnostics and Docker troubleshooting - **@sre-specialist** - SRE best practices, incident command, post-mortem facilitation ## Troubleshooting ### False Positive Rollback Recommendations If rollback confidence is too high despite unrelated incident: ```bash # Override with manual investigation /devops:incident-response --severity high --service api --symptoms "latency spike" --analyze-deployment v2.1.0 --dry-run # Review decision logic # Check for deployment correlation false positives # Adjust confidence scoring thresholds if needed ``` ### Rollback Fails Due to Schema Incompatibility ```bash # Check schema compatibility before rollback kubectl exec -it deployment/$SERVICE -- ./scripts/check-schema-compatibility.sh # If incompatible, use hotfix instead of rollback /devops:incident-response --severity critical --service api --symptoms "errors" --rollback=false ``` ### Metrics Not Available in Prometheus ```bash # Verify Prometheus scraping curl http://prometheus:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job=="$SERVICE")' # Fall back to log-based analysis /devops:incident-response --severity high --service api --symptoms "errors" --use-logs-only ``` ## Security Considerations 1. **Automated Rollback Authorization** - P0 incidents allow auto-rollback; P1+ require approval 2. **Audit Logging** - All incident actions logged with timestamps and approvers 3. **RBAC** - Incident response permissions aligned with on-call roles 4. **Sensitive Data** - Post-mortems scrubbed for PII, secrets, customer data 5. **Communication Channels** - Stakeholder updates sent only to authorized channels 6. **Rollback Safety** - Pre-rollback checks prevent incompatible version deployment ## Version History - v2.0.0 - Initial Schema v2.0 release with Context7 integration - Complete incident response automation with P0/P1/P2/P3 workflows - Automated diagnostics: health checks, deployment analysis, log correlation - Decision intelligence: rollback confidence scoring, decision trees - Post-mortem automation with RCA templates - Communication template generation