# AI-Assisted Development Metrics & Issue Classification System

## Technical Specification

**Version:** 1.0.0  
**Status:** Draft  
**Related PRD:** `ai-prd.md`  
**Target Release:** Q1 2025

---

## Architecture Overview

```
┌─────────────────────────────────────────────────────────────┐
│  Roadcrew Commands (User Interface)                        │
│  /scope-release, /analyze-epic, /implement-*               │
└────────────────────┬────────────────────────────────────────┘
                     │
┌────────────────────▼────────────────────────────────────────┐
│  Classification Engine (New)                                │
│  - Issue scoring (1-10)                                     │
│  - Confidence calculation                                   │
│  - Risk factor analysis                                     │
└────────────────────┬────────────────────────────────────────┘
                     │
        ┌────────────┼────────────┐
        ▼            ▼            ▼
┌──────────┐  ┌──────────┐  ┌──────────┐
│ Repo     │  │ Issue    │  │Historical│
│ Analysis │  │ Content  │  │  Data    │
│(Enhanced)│  │ Parser   │  │ (Metrics)│
└──────────┘  └──────────┘  └──────────┘
        │            │            │
        └────────────┼────────────┘
                     ▼
┌─────────────────────────────────────────────────────────────┐
│  Metrics Storage & Calibration                              │
│  - Implementation metrics (JSON)                            │
│  - Calibration loop (weekly)                                │
│  - Dashboard generation                                     │
└─────────────────────────────────────────────────────────────┘
```

---

## Technology Stack

**Language:** TypeScript (Node.js 18+)  
**Package Manager:** npm  
**Testing:** Jest  
**Schema Validation:** Zod  
**Data Storage:** JSON files (structured)  
**Visualization:** Chart.js (for dashboard)  
**Git Integration:** simple-git library  
**GitHub Integration:** @octokit/rest

---

## Epic Breakdown

### Epic 1: Classification Utilities Foundation

**Goal:** Create core classification system that scores issues 1-10 with confidence ratings.

#### Issue 1.1: Classification Data Structures

**Files to Create:**
- `scripts/utils/classification-types.ts`

**Type Definitions:**
```typescript
export type ComplexityScore = 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10;

export type RiskLevel = 'low' | 'medium' | 'high';

export type RiskZone = 'green' | 'yellow' | 'orange' | 'red';

export interface ScoreAdjustment {
  factor: string;           // "Core model change"
  delta: number;            // +2, -1, etc.
  rationale: string;        // Why this adjustment
}

export interface RiskFactor {
  category: string;         // "Test Coverage", "Dependency"
  severity: RiskLevel;
  description: string;
}

export interface IssueAssessment {
  issueNumber: string;
  score: ComplexityScore;
  confidence: number;       // 0-100
  riskLevel: RiskLevel;
  riskZone: RiskZone;
  
  rationale: string[];      // Human-readable explanations
  adjustments: ScoreAdjustment[];
  riskFactors: RiskFactor[];
  
  estimatedHours: number;
  recommendedAssignee: 'ai' | 'junior' | 'mid' | 'senior' | 'staff';
  
  // Metadata
  assessedAt: Date;
  assessmentVersion: string; // For tracking algorithm changes
}

export interface ClassificationContext {
  // From repo analysis
  coreFiles: string[];
  testCoverage: Record<string, number>; // file -> coverage %
  changeFrequency: Record<string, number>; // file -> changes/month
  
  // From issue content
  affectedFiles: string[];
  estimatedLines: number;
  hasBreakingChanges: boolean;
  requiresSchemas: boolean;
  
  // Historical
  similarIssues: SimilarIssue[];
}

export interface SimilarIssue {
  number: string;
  title: string;
  predictedScore: number;
  actualScore: number;
  fileOverlap: number; // 0-1 similarity
}
```

**Acceptance Criteria:**
- [ ] All types exported from single module
- [ ] Types validated with Zod schemas
- [ ] JSDoc comments on all public interfaces
- [ ] Examples in comments

---

#### Issue 1.2: Base Scoring Algorithm

**Files to Create:**
- `scripts/utils/classification-scorer.ts`

**Core Algorithm:**
```typescript
import { IssueAssessment, ClassificationContext } from './classification-types';

export class ClassificationScorer {
  
  /**
   * Calculate initial base score from issue characteristics
   */
  private calculateBaseScore(context: ClassificationContext): number {
    let score = 5; // Start at medium
    
    // File count adjustment
    const fileCount = context.affectedFiles.length;
    if (fileCount === 1) score -= 1;
    if (fileCount >= 3) score += 1;
    if (fileCount >= 5) score += 1;
    
    // Lines changed adjustment
    const lines = context.estimatedLines;
    if (lines < 10) score -= 1;
    if (lines >= 50) score += 1;
    if (lines >= 100) score += 2;
    
    // Core files adjustment
    const touchesCoreFile = context.affectedFiles.some(f => 
      context.coreFiles.includes(f)
    );
    if (touchesCoreFile) score += 2;
    
    // Breaking changes
    if (context.hasBreakingChanges) score += 2;
    
    // Schema changes
    if (context.requiresSchemas) score += 1;
    
    // Clamp to 1-10
    return Math.max(1, Math.min(10, Math.round(score)));
  }
  
  /**
   * Apply test coverage adjustments
   */
  private adjustForTestCoverage(
    baseScore: number,
    context: ClassificationContext
  ): { score: number; adjustment: ScoreAdjustment | null } {
    
    const avgCoverage = this.calculateAverageCoverage(
      context.affectedFiles,
      context.testCoverage
    );
    
    if (avgCoverage < 0.5) {
      return {
        score: baseScore + 1,
        adjustment: {
          factor: "Low Test Coverage",
          delta: +1,
          rationale: `Affected files have ${Math.round(avgCoverage * 100)}% test coverage`
        }
      };
    }
    
    if (avgCoverage > 0.9) {
      return {
        score: baseScore - 1,
        adjustment: {
          factor: "High Test Coverage",
          delta: -1,
          rationale: `Affected files have ${Math.round(avgCoverage * 100)}% test coverage`
        }
      };
    }
    
    return { score: baseScore, adjustment: null };
  }
  
  /**
   * Apply change frequency adjustments (volatile code is riskier)
   */
  private adjustForChangeFrequency(
    baseScore: number,
    context: ClassificationContext
  ): { score: number; adjustment: ScoreAdjustment | null } {
    
    const avgFrequency = this.calculateAverageChangeFrequency(
      context.affectedFiles,
      context.changeFrequency
    );
    
    // High change frequency = code is volatile
    if (avgFrequency > 10) { // >10 changes/month
      return {
        score: baseScore + 1,
        adjustment: {
          factor: "High Change Frequency",
          delta: +1,
          rationale: `Files change ${avgFrequency.toFixed(1)} times/month (volatile)`
        }
      };
    }
    
    return { score: baseScore, adjustment: null };
  }
  
  /**
   * Calculate confidence based on historical accuracy
   */
  private calculateConfidence(
    score: number,
    context: ClassificationContext
  ): number {
    
    // Start with base confidence
    let confidence = 75;
    
    // Adjust based on similar issues
    if (context.similarIssues.length >= 3) {
      const avgVariance = this.calculateAverageVariance(
        context.similarIssues
      );
      
      // Low variance = high confidence
      if (avgVariance < 1.5) confidence += 15;
      if (avgVariance > 3) confidence -= 15;
    } else {
      // Few similar issues = less confident
      confidence -= 10;
    }
    
    // Clamp to 0-100
    return Math.max(0, Math.min(100, confidence));
  }
  
  /**
   * Main scoring function
   */
  public scoreIssue(context: ClassificationContext): IssueAssessment {
    // Calculate base score
    const baseScore = this.calculateBaseScore(context);
    
    // Apply adjustments
    const adjustments: ScoreAdjustment[] = [];
    let currentScore = baseScore;
    
    const coverageResult = this.adjustForTestCoverage(currentScore, context);
    currentScore = coverageResult.score;
    if (coverageResult.adjustment) {
      adjustments.push(coverageResult.adjustment);
    }
    
    const frequencyResult = this.adjustForChangeFrequency(currentScore, context);
    currentScore = frequencyResult.score;
    if (frequencyResult.adjustment) {
      adjustments.push(frequencyResult.adjustment);
    }
    
    // Clamp final score
    const finalScore = Math.max(1, Math.min(10, currentScore)) as ComplexityScore;
    
    // Calculate confidence
    const confidence = this.calculateConfidence(finalScore, context);
    
    // Determine risk level and zone
    const { riskLevel, riskZone } = this.determineRiskLevel(finalScore);
    
    // Generate rationale
    const rationale = this.generateRationale(
      baseScore,
      finalScore,
      adjustments,
      context
    );
    
    // Extract risk factors
    const riskFactors = this.extractRiskFactors(context);
    
    // Estimate hours
    const estimatedHours = this.estimateHours(finalScore, context);
    
    // Recommend assignee
    const recommendedAssignee = this.recommendAssignee(finalScore);
    
    return {
      issueNumber: context.issueNumber,
      score: finalScore,
      confidence,
      riskLevel,
      riskZone,
      rationale,
      adjustments,
      riskFactors,
      estimatedHours,
      recommendedAssignee,
      assessedAt: new Date(),
      assessmentVersion: '1.0.0'
    };
  }
  
  /**
   * Map complexity score to risk zone
   * Zones define AI/human collaboration model:
   * - green (1-3): ai-solo - AI-autonomous, human reactive only
   * - yellow (4-6): ai-led - AI-led, human validates  
   * - orange (7-8): ai-assisted - Human-led, AI assists
   * - red (9-10): ai-limited - Human-owned
   * 
   * Assignment rules configured in config/roadcrew.yml
   */
  private determineRiskLevel(score: ComplexityScore): {
    riskLevel: RiskLevel;
    riskZone: RiskZone;
  } {
    if (score <= 3) return { riskLevel: 'low', riskZone: 'green' };      // ai-solo
    if (score <= 6) return { riskLevel: 'medium', riskZone: 'yellow' };  // ai-led
    if (score <= 8) return { riskLevel: 'high', riskZone: 'orange' };    // ai-assisted
    return { riskLevel: 'high', riskZone: 'red' };                        // ai-limited
  }
  
  /**
   * Recommend assignee type based on complexity score
   * Maps to resource types defined in config/roadcrew.yml
   */
  private recommendAssignee(score: ComplexityScore): IssueAssessment['recommendedAssignee'] {
    if (score <= 3) return 'ai';       // ai-solo: GitHub Copilot, autonomous
    if (score <= 5) return 'junior';   // ai-led: Jr. engineer with AI support
    if (score <= 7) return 'mid';      // ai-assisted: Mid-level engineer, AI helps
    if (score <= 9) return 'senior';   // ai-limited: Senior engineer
    return 'staff';                     // ai-limited: Staff/expert engineer
  }
  
  private estimateHours(score: ComplexityScore, context: ClassificationContext): number {
    // Base estimate from score
    const baseHours = [0.5, 1, 2, 3, 5, 8, 13, 21, 34, 55][score - 1];
    
    // Adjust based on similar issues
    if (context.similarIssues.length > 0) {
      const avgActualHours = context.similarIssues.reduce((sum, issue) => 
        sum + issue.actualHours, 0
      ) / context.similarIssues.length;
      
      // Blend historical with base estimate (70/30)
      return avgActualHours * 0.7 + baseHours * 0.3;
    }
    
    return baseHours;
  }
}
```

**Acceptance Criteria:**
- [ ] Scoring algorithm deterministic (same input = same output)
- [ ] All score ranges covered (1-10)
- [ ] Adjustments clearly explained in output
- [ ] Unit tests for each adjustment function
- [ ] Edge cases handled (empty files, null coverage, etc.)

---

#### Issue 1.3: Historical Data Integration

**Files to Create:**
- `scripts/utils/classification-history.ts`

**Purpose:** Load and query historical implementation metrics for similarity matching.

**Key Functions:**
```typescript
export class ClassificationHistory {
  
  private metricsDir = 'config/metrics/implementations';
  
  /**
   * Find issues similar to current issue
   */
  public async findSimilarIssues(
    title: string,
    affectedFiles: string[],
    limit: number = 5
  ): Promise<SimilarIssue[]> {
    
    const allMetrics = await this.loadAllMetrics();
    
    // Calculate similarity scores
    const scored = allMetrics.map(m => ({
      metric: m,
      similarity: this.calculateSimilarity(title, affectedFiles, m)
    }));
    
    // Sort by similarity and take top N
    return scored
      .sort((a, b) => b.similarity - a.similarity)
      .slice(0, limit)
      .map(s => this.toSimilarIssue(s.metric, s.similarity));
  }
  
  /**
   * Calculate similarity between current issue and historical metric
   */
  private calculateSimilarity(
    title: string,
    files: string[],
    metric: ImplementationMetrics
  ): number {
    
    // Title similarity (simple word overlap)
    const titleScore = this.calculateTitleSimilarity(title, metric.title);
    
    // File overlap (Jaccard index)
    const fileScore = this.calculateFileOverlap(files, metric.filesModified);
    
    // Weighted combination (60% file, 40% title)
    return fileScore * 0.6 + titleScore * 0.4;
  }
  
  /**
   * Load all historical metrics
   */
  private async loadAllMetrics(): Promise<ImplementationMetrics[]> {
    const files = await fs.readdir(this.metricsDir);
    const jsonFiles = files.filter(f => f.endsWith('.json'));
    
    const metrics = await Promise.all(
      jsonFiles.map(async f => {
        const content = await fs.readFile(
          path.join(this.metricsDir, f),
          'utf8'
        );
        return JSON.parse(content) as ImplementationMetrics;
      })
    );
    
    return metrics;
  }
}
```

**Acceptance Criteria:**
- [ ] Finds similar issues in <1 second (even with 1000+ historical metrics)
- [ ] Similarity algorithm validated against human judgment
- [ ] Handles missing/incomplete historical data gracefully
- [ ] Caches loaded metrics in memory for performance

---

### Epic 2: Enhanced Repository Analysis

**Goal:** Extend `/analyze-repo` to capture complexity metrics needed for classification.

#### Issue 2.1: Static Code Metrics Collection

**Files to Modify:**
- `scripts/commands/analyze-repo.ts`

**Files to Create:**
- `scripts/utils/code-metrics.ts`

**New Metrics to Collect:**
```typescript
export interface CodeComplexityMetrics {
  coreFiles: string[];           // High-risk files
  peripheralFiles: string[];     // Low-risk files
  
  fileMetrics: Record<string, FileMetrics>;
  directoryMetrics: Record<string, DirectoryMetrics>;
  
  complexityHotspots: ComplexityHotspot[];
}

export interface FileMetrics {
  path: string;
  linesOfCode: number;
  functions: number;
  averageFunctionSize: number;
  cyclomaticComplexity: number;
  dependencies: string[];      // Imports
  dependencyDepth: number;     // Max depth in dep graph
  lastModified: Date;
  changeFrequency: number;     // Changes per month (last 6 months)
  testCoverage: number;        // 0-1
}

export interface DirectoryMetrics {
  path: string;
  totalFiles: number;
  totalLines: number;
  averageComplexity: number;
  changeFrequency: number;
  testCoverage: number;
}

export interface ComplexityHotspot {
  path: string;
  reason: string;              // Why it's a hotspot
  score: number;               // 1-10
  riskLevel: RiskLevel;
}
```

**Implementation Approach:**
```typescript
export class CodeMetricsCollector {
  
  /**
   * Analyze codebase and generate metrics
   */
  public async collectMetrics(repoPath: string): Promise<CodeComplexityMetrics> {
    
    // Step 1: Find all source files
    const sourceFiles = await this.findSourceFiles(repoPath);
    
    // Step 2: Analyze each file
    const fileMetrics: Record<string, FileMetrics> = {};
    for (const file of sourceFiles) {
      fileMetrics[file] = await this.analyzeFile(file, repoPath);
    }
    
    // Step 3: Aggregate directory metrics
    const directoryMetrics = this.aggregateDirectoryMetrics(fileMetrics);
    
    // Step 4: Classify core vs. peripheral
    const { coreFiles, peripheralFiles } = this.classifyFiles(fileMetrics);
    
    // Step 5: Identify complexity hotspots
    const complexityHotspots = this.identifyHotspots(fileMetrics);
    
    return {
      coreFiles,
      peripheralFiles,
      fileMetrics,
      directoryMetrics,
      complexityHotspots
    };
  }
  
  /**
   * Analyze single file for complexity metrics
   */
  private async analyzeFile(
    filePath: string,
    repoPath: string
  ): Promise<FileMetrics> {
    
    const fullPath = path.join(repoPath, filePath);
    const content = await fs.readFile(fullPath, 'utf8');
    
    // Count lines (excluding comments/whitespace)
    const linesOfCode = this.countLinesOfCode(content);
    
    // Parse and analyze functions
    const ast = this.parseToAST(content, filePath);
    const functions = this.extractFunctions(ast);
    const avgFunctionSize = this.calculateAvgFunctionSize(functions);
    const cyclomaticComplexity = this.calculateCyclomaticComplexity(ast);
    
    // Extract dependencies (imports)
    const dependencies = this.extractDependencies(ast);
    
    // Get git history metrics
    const { lastModified, changeFrequency } = await this.getGitMetrics(
      filePath,
      repoPath
    );
    
    // Get test coverage (if available)
    const testCoverage = await this.getTestCoverage(filePath, repoPath);
    
    return {
      path: filePath,
      linesOfCode,
      functions: functions.length,
      averageFunctionSize: avgFunctionSize,
      cyclomaticComplexity,
      dependencies,
      dependencyDepth: 0, // Calculated later in dependency graph analysis
      lastModified,
      changeFrequency,
      testCoverage
    };
  }
  
  /**
   * Classify files as core (high-risk) or peripheral (low-risk)
   */
  private classifyFiles(
    fileMetrics: Record<string, FileMetrics>
  ): { coreFiles: string[]; peripheralFiles: string[] } {
    
    const coreFiles: string[] = [];
    const peripheralFiles: string[] = [];
    
    for (const [path, metrics] of Object.entries(fileMetrics)) {
      // Core file indicators:
      // - In critical directories (auth, database, payments)
      // - High change frequency (>5 changes/month)
      // - Many dependents (other files import it)
      // - High cyclomatic complexity (>10)
      
      const isCriticalDir = ['auth', 'database', 'prisma', 'payments', 'billing']
        .some(dir => path.includes(dir));
      
      const isHighChangeFreq = metrics.changeFrequency > 5;
      const isHighComplexity = metrics.cyclomaticComplexity > 10;
      
      if (isCriticalDir || isHighChangeFreq || isHighComplexity) {
        coreFiles.push(path);
      } else {
        peripheralFiles.push(path);
      }
    }
    
    return { coreFiles, peripheralFiles };
  }
  
  /**
   * Get git history metrics for a file
   */
  private async getGitMetrics(
    filePath: string,
    repoPath: string
  ): Promise<{ lastModified: Date; changeFrequency: number }> {
    
    const git = simpleGit(repoPath);
    
    // Get commit history for file
    const log = await git.log({ file: filePath });
    
    if (log.total === 0) {
      return {
        lastModified: new Date(),
        changeFrequency: 0
      };
    }
    
    // Last modified = most recent commit
    const lastModified = new Date(log.latest.date);
    
    // Change frequency = commits in last 6 months / 6
    const sixMonthsAgo = new Date();
    sixMonthsAgo.setMonth(sixMonthsAgo.getMonth() - 6);
    
    const recentCommits = log.all.filter(commit => 
      new Date(commit.date) > sixMonthsAgo
    );
    
    const changeFrequency = recentCommits.length / 6; // per month
    
    return { lastModified, changeFrequency };
  }
}
```

**Acceptance Criteria:**
- [ ] Analyzes TypeScript, JavaScript, Python, and Go files
- [ ] Completes analysis of 1000-file repo in <60 seconds
- [ ] Accurately classifies core vs. peripheral files
- [ ] Change frequency calculation verified against git log
- [ ] Test coverage integration works with Jest, Playwright, pytest

---

#### Issue 2.2: Historical Performance Metrics

**Files to Create:**
- `scripts/utils/historical-performance.ts`

**Purpose:** Analyze past implementations to identify patterns and risks.

**Metrics to Calculate:**
```typescript
export interface HistoricalPerformance {
  velocityByScore: Record<ComplexityScore, VelocityMetrics>;
  commonUnderestimates: UnderestimatePattern[];
  highRiskAreas: HighRiskArea[];
  reworkRateByArea: Record<string, number>; // directory -> rework %
}

export interface VelocityMetrics {
  score: ComplexityScore;
  sampleSize: number;
  avgHours: number;
  stdDeviation: number;
  minHours: number;
  maxHours: number;
}

export interface UnderestimatePattern {
  pattern: string;              // "Database migrations", "Auth changes"
  avgPredictedScore: number;
  avgActualScore: number;
  variance: number;
  occurrences: number;
}

export interface HighRiskArea {
  path: string;
  riskScore: number;           // 1-10
  reasons: string[];
  recentFailures: number;      // Last 3 months
  avgReworkRate: number;       // 0-1
}
```

**Implementation:**
```typescript
export class HistoricalPerformanceAnalyzer {
  
  /**
   * Analyze all historical metrics to generate performance insights
   */
  public async analyze(): Promise<HistoricalPerformance> {
    
    const allMetrics = await this.loadAllMetrics();
    
    return {
      velocityByScore: this.calculateVelocityByScore(allMetrics),
      commonUnderestimates: this.identifyUnderestimates(allMetrics),
      highRiskAreas: this.identifyHighRiskAreas(allMetrics),
      reworkRateByArea: this.calculateReworkRates(allMetrics)
    };
  }
  
  /**
   * Calculate average velocity for each complexity score
   */
  private calculateVelocityByScore(
    metrics: ImplementationMetrics[]
  ): Record<ComplexityScore, VelocityMetrics> {
    
    const byScore: Record<number, ImplementationMetrics[]> = {};
    
    // Group by actual score
    for (const m of metrics) {
      const score = m.actualScore || m.predictedScore;
      if (!byScore[score]) byScore[score] = [];
      byScore[score].push(m);
    }
    
    // Calculate stats for each score
    const result: Record<ComplexityScore, VelocityMetrics> = {} as any;
    
    for (let score = 1; score <= 10; score++) {
      const group = byScore[score] || [];
      
      if (group.length === 0) {
        result[score as ComplexityScore] = {
          score: score as ComplexityScore,
          sampleSize: 0,
          avgHours: 0,
          stdDeviation: 0,
          minHours: 0,
          maxHours: 0
        };
        continue;
      }
      
      const hours = group.map(m => m.actualHours);
      const avgHours = hours.reduce((a, b) => a + b, 0) / hours.length;
      const variance = hours.reduce((sum, h) => 
        sum + Math.pow(h - avgHours, 2), 0
      ) / hours.length;
      const stdDeviation = Math.sqrt(variance);
      
      result[score as ComplexityScore] = {
        score: score as ComplexityScore,
        sampleSize: group.length,
        avgHours,
        stdDeviation,
        minHours: Math.min(...hours),
        maxHours: Math.max(...hours)
      };
    }
    
    return result;
  }
  
  /**
   * Identify patterns where we consistently underestimate
   */
  private identifyUnderestimates(
    metrics: ImplementationMetrics[]
  ): UnderestimatePattern[] {
    
    // Group by keywords in title/description
    const patterns = new Map<string, ImplementationMetrics[]>();
    
    const keywords = [
      'migration', 'refactor', 'auth', 'database', 'schema',
      'security', 'payment', 'breaking', 'integration'
    ];
    
    for (const m of metrics) {
      const title = m.title?.toLowerCase() || '';
      
      for (const keyword of keywords) {
        if (title.includes(keyword)) {
          if (!patterns.has(keyword)) patterns.set(keyword, []);
          patterns.get(keyword)!.push(m);
        }
      }
    }
    
    // Calculate variance for each pattern
    const underestimates: UnderestimatePattern[] = [];
    
    for (const [pattern, group] of patterns) {
      if (group.length < 3) continue; // Need at least 3 samples
      
      const avgPredicted = group.reduce((sum, m) => 
        sum + m.predictedScore, 0
      ) / group.length;
      
      const avgActual = group.reduce((sum, m) => 
        sum + (m.actualScore || m.predictedScore), 0
      ) / group.length;
      
      const variance = avgActual - avgPredicted;
      
      // Only include if we consistently underestimate (variance > 1.5)
      if (variance > 1.5) {
        underestimates.push({
          pattern,
          avgPredictedScore: avgPredicted,
          avgActualScore: avgActual,
          variance,
          occurrences: group.length
        });
      }
    }
    
    return underestimates.sort((a, b) => b.variance - a.variance);
  }
}
```

**Acceptance Criteria:**
- [ ] Velocity calculations match manual verification
- [ ] Underestimate patterns identified with >3 samples
- [ ] High-risk areas include clear rationale
- [ ] Rework rates calculated per directory
- [ ] Performance analyzer runs in <5 seconds

---

#### Issue 2.3: Update Repo Analysis Output Format

**Files to Modify:**
- `scripts/commands/analyze-repo.ts`
- `templates/repo-analysis.template.md`

**New Sections to Add:**

```markdown
## Complexity Metrics

### Core vs. Peripheral Code
**Core Files (High Risk):** {{coreFiles.length}} files
{{#each coreFiles}}
- {{this}}
{{/each}}

**Peripheral Files (Low Risk):** {{peripheralFiles.length}} files

### Change Frequency (Last 6 Months)
| File/Directory | Changes/Month | Risk Level |
|----------------|---------------|------------|
{{#each topChangeFrequency}}
| {{path}} | {{frequency}} | {{riskLevel}} |
{{/each}}

### Test Coverage by Area
| Directory | Coverage | Files | Risk Level |
|-----------|----------|-------|------------|
{{#each directoryMetrics}}
| {{path}} | {{coverage}}% | {{fileCount}} | {{riskLevel}} |
{{/each}}

### Complexity Hotspots
{{#each complexityHotspots}}
**{{path}}** (Score: {{score}}/10)
- {{reason}}
- Risk Level: {{riskLevel}}
{{/each}}

## Historical Performance

### Velocity by Complexity Score
| Score | Avg Hours | Std Dev | Samples |
|-------|-----------|---------|---------|
{{#each velocityByScore}}
| {{score}} | {{avgHours}} | {{stdDeviation}} | {{sampleSize}} |
{{/each}}

### Common Underestimates
{{#each commonUnderestimates}}
**{{pattern}}** ({{occurrences}} occurrences)
- Predicted: {{avgPredictedScore}}/10
- Actual: {{avgActualScore}}/10
- Variance: +{{variance}}
{{/each}}

### High-Risk Areas
{{#each highRiskAreas}}
**{{path}}** (Risk: {{riskScore}}/10)
- Recent Failures: {{recentFailures}} in last 3 months
- Rework Rate: {{reworkRate}}%
- Reasons:
{{#each reasons}}
  - {{this}}
{{/each}}
{{/each}}
```

**Acceptance Criteria:**
- [ ] New sections added to repo analysis template
- [ ] All metrics displayed in human-readable format
- [ ] Color coding for risk levels (if terminal supports)
- [ ] Summary statistics at top (total files, avg complexity, etc.)
- [ ] JSON output option for programmatic access

---

### Epic 3: AI Self-Assessment Integration

**Goal:** Integrate classification engine into issue creation and analysis workflows.

#### Issue 3.1: Issue Content Parser

**Files to Create:**
- `scripts/utils/issue-parser.ts`

**Purpose:** Extract classification context from issue markdown.

```typescript
export interface ParsedIssueContent {
  title: string;
  description: string;
  acceptanceCriteria: string[];
  technicalApproach?: string;
  affectedFiles: string[];
  dependencies: string[];
  labels: string[];
  epicContext?: {
    epicNumber: string;
    epicTitle: string;
  };
}

export class IssueContentParser {
  
  /**
   * Parse issue markdown to extract classification context
   */
  public parseIssue(markdown: string): ParsedIssueContent {
    
    const sections = this.splitIntoSections(markdown);
    
    return {
      title: this.extractTitle(markdown),
      description: this.extractDescription(sections),
      acceptanceCriteria: this.extractAcceptanceCriteria(sections),
      technicalApproach: this.extractTechnicalApproach(sections),
      affectedFiles: this.extractAffectedFiles(sections),
      dependencies: this.extractDependencies(sections),
      labels: this.extractLabels(markdown),
      epicContext: this.extractEpicContext(markdown)
    };
  }
  
  /**
   * Extract files mentioned in Technical Implementation section
   */
  private extractAffectedFiles(sections: Map<string, string>): string[] {
    
    const techSection = sections.get('Technical Implementation') 
      || sections.get('Files to Modify')
      || '';
    
    // Match file paths (src/*, scripts/*, etc.)
    const fileRegex = /(?:^|\s)([a-z0-9/_-]+\.[a-z]{1,4})\b/gi;
    const matches = techSection.matchAll(fileRegex);
    
    const files = new Set<string>();
    for (const match of matches) {
      files.add(match[1]);
    }
    
    return Array.from(files);
  }
  
  /**
   * Estimate lines of code from description
   */
  public estimateLinesOfCode(parsed: ParsedIssueContent): number {
    
    const description = parsed.description.toLowerCase();
    const approach = parsed.technicalApproach?.toLowerCase() || '';
    
    // Keywords that indicate size
    const smallKeywords = ['rename', 'delete', 'move', 'comment', 'readme'];
    const mediumKeywords = ['add', 'update', 'modify', 'fix'];
    const largeKeywords = ['refactor', 'migrate', 'implement', 'create'];
    
    const hasSmall = smallKeywords.some(k => 
      description.includes(k) || approach.includes(k)
    );
    const hasLarge = largeKeywords.some(k => 
      description.includes(k) || approach.includes(k)
    );
    
    if (hasSmall) return 10;
    if (hasLarge) return 100;
    return 50; // Medium default
  }
}
```

**Acceptance Criteria:**
- [ ] Parses all standard issue template sections
- [ ] Extracts file paths with 90%+ accuracy
- [ ] Handles malformed markdown gracefully
- [ ] Line estimation within 2x of actual (validated on 20 past issues)

---

#### Issue 3.2: Classification Engine Integration

**Files to Create:**
- `scripts/utils/classify-issue.ts`

**Purpose:** Main entry point for classifying an issue.

```typescript
import { ClassificationScorer } from './classification-scorer';
import { IssueContentParser } from './issue-parser';
import { ClassificationHistory } from './classification-history';
import { CodeMetricsCollector } from './code-metrics';

export class IssueClassifier {
  
  private scorer: ClassificationScorer;
  private parser: IssueContentParser;
  private history: ClassificationHistory;
  
  constructor() {
    this.scorer = new ClassificationScorer();
    this.parser = new IssueContentParser();
    this.history = new ClassificationHistory();
  }
  
  /**
   * Classify an issue and return full assessment
   */
  public async classifyIssue(
    issueNumber: string,
    issueMarkdown: string,
    repoMetrics: CodeComplexityMetrics
  ): Promise<IssueAssessment> {
    
    // Step 1: Parse issue content
    const parsed = this.parser.parseIssue(issueMarkdown);
    
    // Step 2: Find similar historical issues
    const similarIssues = await this.history.findSimilarIssues(
      parsed.title,
      parsed.affectedFiles
    );
    
    // Step 3: Build classification context
    const context: ClassificationContext = {
      issueNumber,
      coreFiles: repoMetrics.coreFiles,
      testCoverage: this.extractTestCoverage(
        parsed.affectedFiles,
        repoMetrics
      ),
      changeFrequency: this.extractChangeFrequency(
        parsed.affectedFiles,
        repoMetrics
      ),
      affectedFiles: parsed.affectedFiles,
      estimatedLines: this.parser.estimateLinesOfCode(parsed),
      hasBreakingChanges: this.detectBreakingChanges(parsed),
      requiresSchema: this.detectSchemaChanges(parsed),
      similarIssues
    };
    
    // Step 4: Score the issue
    const assessment = this.scorer.scoreIssue(context);
    
    // Step 5: Store assessment for future calibration
    await this.storeAssessment(assessment);
    
    return assessment;
  }
  
  /**
   * Detect if issue involves breaking changes
   */
  private detectBreakingChanges(parsed: ParsedIssueContent): boolean {
    
    const text = (parsed.description + ' ' + 
      (parsed.technicalApproach || '')).toLowerCase();
    
    const breakingKeywords = [
      'breaking change',
      'breaking api',
      'remove deprecated',
      'major version',
      'incompatible'
    ];
    
    return breakingKeywords.some(k => text.includes(k));
  }
  
  /**
   * Detect if issue involves schema/database changes
   */
  private detectSchemaChanges(parsed: ParsedIssueContent): boolean {
    
    // Check if affects schema files
    const affectsSchema = parsed.affectedFiles.some(f => 
      f.includes('schema.prisma') || 
      f.includes('migration') ||
      f.includes('models/')
    );
    
    if (affectsSchema) return true;
    
    // Check description for schema keywords
    const text = (parsed.description + ' ' + 
      (parsed.technicalApproach || '')).toLowerCase();
    
    const schemaKeywords = [
      'database',
      'schema',
      'migration',
      'prisma',
      'add column',
      'remove field'
    ];
    
    return schemaKeywords.some(k => text.includes(k));
  }
}
```

**Acceptance Criteria:**
- [ ] Classification completes in <5 seconds
- [ ] All context sources integrated (repo, history, issue content)
- [ ] Assessment stored for calibration
- [ ] Handles missing repo metrics gracefully (degraded mode)
- [ ] Unit tests cover all branches

---

#### Issue 3.3: Integrate into `/scope-release`

**Files to Modify:**
- `scripts/commands/scope-release.ts`

**Changes:**
```typescript
// After creating issues, classify each one
for (const issue of createdIssues) {
  const issueContent = await gh.issues.get({
    owner,
    repo,
    issue_number: issue.number
  });
  
  const assessment = await classifier.classifyIssue(
    issue.number.toString(),
    issueContent.data.body || '',
    repoMetrics
  );
  
  // Add classification label
  await gh.issues.addLabels({
    owner,
    repo,
    issue_number: issue.number,
    labels: [
      `score-${assessment.score}`,
      `risk-${assessment.riskLevel}`,
      assessment.riskZone
    ]
  });
  
  // Add classification comment
  await gh.issues.createComment({
    owner,
    repo,
    issue_number: issue.number,
    body: formatClassificationComment(assessment)
  });
}

function formatClassificationComment(assessment: IssueAssessment): string {
  const emoji = {
    green: '🟢',
    yellow: '🟡',
    orange: '🟠',
    red: '🔴'
  }[assessment.riskZone];
  
  return `## ${emoji} AI Classification

**Complexity Score:** ${assessment.score}/10 (${assessment.confidence}% confident)
**Risk Level:** ${assessment.riskLevel.toUpperCase()}
**Estimated Effort:** ${assessment.estimatedHours} hours
**Recommended Assignee:** ${assessment.recommendedAssignee}

### Rationale
${assessment.rationale.map(r => `- ${r}`).join('\n')}

${assessment.adjustments.length > 0 ? `
### Score Adjustments
${assessment.adjustments.map(a => 
  `- **${a.factor}** (${a.delta > 0 ? '+' : ''}${a.delta}): ${a.rationale}`
).join('\n')}
` : ''}

${assessment.riskFactors.length > 0 ? `
### Risk Factors
${assessment.riskFactors.map(r => 
  `- **${r.category}** (${r.severity}): ${r.description}`
).join('\n')}
` : ''}

---
*Classification Version: ${assessment.assessmentVersion}*
*Assessed: ${assessment.assessedAt.toISOString()}*
`;
}
```

**Acceptance Criteria:**
- [ ] All created issues automatically classified
- [ ] Classification labels applied correctly
- [ ] Classification comment formatted clearly
- [ ] No impact on existing `/scope-release` functionality
- [ ] Works for both current-release and minor-release

---

### Epic 4: Metrics Tracking Implementation

**Goal:** Capture comprehensive metrics during AI-assisted implementation.

#### Issue 4.1: Metrics Schema & Storage

**Files to Create:**
- `scripts/utils/metrics-schema.ts`
- `scripts/utils/metrics-storage.ts`

**Metrics Schema:**
```typescript
import { z } from 'zod';

export const ImplementationMetricsSchema = z.object({
  // Identity
  issueNumber: z.string(),
  title: z.string(),
  implementedBy: z.enum(['ai', 'human', 'hybrid']),
  
  // Timing
  startTime: z.date(),
  endTime: z.date(),
  durationMinutes: z.number(),
  
  // AI Usage
  tokensUsed: z.number(),
  aiAutonomyLevel: z.number().min(1).max(5),
  humanInterventions: z.number(),
  
  // Code Changes
  filesModified: z.array(z.string()),
  linesAdded: z.number(),
  linesDeleted: z.number(),
  linesChanged: z.number(), // added + deleted
  
  // Quality
  testsAdded: z.number(),
  testsPassing: z.boolean(),
  lintErrors: z.number(),
  typeErrors: z.number(),
  
  // Outcome
  firstAttemptSuccess: z.boolean(),
  reworkRequired: z.boolean(),
  scrapped: z.boolean(),
  
  // Prediction vs. Actual
  predictedScore: z.number().min(1).max(10),
  predictedHours: z.number(),
  actualScore: z.number().min(1).max(10).optional(),
  actualHours: z.number(),
  
  // Metadata
  command: z.string(), // '/implement-issue' or '/implement-epic'
  repoVersion: z.string(), // git commit hash
  classificationVersion: z.string()
});

export type ImplementationMetrics = z.infer<typeof ImplementationMetricsSchema>;
```

**Storage Implementation:**
```typescript
export class MetricsStorage {
  
  private metricsDir = 'config/metrics/implementations';
  
  /**
   * Save implementation metrics to JSON file
   */
  public async saveMetrics(metrics: ImplementationMetrics): Promise<void> {
    
    // Ensure directory exists
    await fs.mkdir(this.metricsDir, { recursive: true });
    
    // Generate filename: YYYY-MM-DD-issue-N.json
    const date = metrics.startTime.toISOString().split('T')[0];
    const filename = `${date}-issue-${metrics.issueNumber}.json`;
    const filepath = path.join(this.metricsDir, filename);
    
    // Validate schema
    const validated = ImplementationMetricsSchema.parse(metrics);
    
    // Write to file
    await fs.writeFile(
      filepath,
      JSON.stringify(validated, null, 2),
      'utf8'
    );
    
    console.log(`✅ Metrics saved: ${filepath}`);
  }
  
  /**
   * Load metrics for a specific issue
   */
  public async loadMetrics(issueNumber: string): Promise<ImplementationMetrics | null> {
    
    const files = await fs.readdir(this.metricsDir);
    const matchingFile = files.find(f => 
      f.includes(`-issue-${issueNumber}.json`)
    );
    
    if (!matchingFile) return null;
    
    const content = await fs.readFile(
      path.join(this.metricsDir, matchingFile),
      'utf8'
    );
    
    return ImplementationMetricsSchema.parse(JSON.parse(content));
  }
  
  /**
   * Load all metrics (for calibration and analysis)
   */
  public async loadAllMetrics(): Promise<ImplementationMetrics[]> {
    
    const files = await fs.readdir(this.metricsDir);
    const jsonFiles = files.filter(f => f.endsWith('.json'));
    
    const metrics = await Promise.all(
      jsonFiles.map(async f => {
        const content = await fs.readFile(
          path.join(this.metricsDir, f),
          'utf8'
        );
        return ImplementationMetricsSchema.parse(JSON.parse(content));
      })
    );
    
    return metrics;
  }
}
```

**Acceptance Criteria:**
- [ ] Schema validates all required fields
- [ ] Metrics files stored in consistent format
- [ ] Load operations handle missing files gracefully
- [ ] Concurrent writes don't corrupt data
- [ ] Schema versioned for future evolution

---

#### Issue 4.2: Metrics Capture in `/implement-issue`

**Files to Modify:**
- `scripts/commands/implement-issue.ts`

**Changes:**
```typescript
import { MetricsStorage } from '../utils/metrics-storage';
import { IssueClassifier } from '../utils/classify-issue';

export async function implementIssue(issueNumber: string): Promise<void> {
  
  const storage = new MetricsStorage();
  const classifier = new IssueClassifier();
  
  // Load classification (from previous scope-release run)
  const assessment = await classifier.loadAssessment(issueNumber);
  
  // Start metrics tracking
  const startTime = new Date();
  let tokensUsed = 0;
  let humanInterventions = 0;
  
  // Track token usage
  const originalFetch = global.fetch;
  global.fetch = async (...args) => {
    const response = await originalFetch(...args);
    if (args[0]?.toString().includes('anthropic')) {
      // Extract token count from response
      const usage = await extractTokenUsage(response.clone());
      tokensUsed += usage;
    }
    return response;
  };
  
  // Implement issue (existing logic)
  // ...
  
  // Track human interventions
  // (increment humanInterventions counter when user input required)
  
  // Get git diff stats
  const git = simpleGit();
  const diff = await git.diff(['--stat', 'HEAD~1', 'HEAD']);
  const { filesModified, linesAdded, linesDeleted } = parseGitDiff(diff);
  
  // Check test results
  const testResults = await runTests();
  
  // Check linting
  const lintResults = await runLinter();
  
  // Calculate actual hours
  const endTime = new Date();
  const durationMinutes = (endTime.getTime() - startTime.getTime()) / 60000;
  const actualHours = durationMinutes / 60;
  
  // Determine outcome
  const firstAttemptSuccess = humanInterventions === 0 && testResults.passing;
  const reworkRequired = humanInterventions > 2;
  const scrapped = false; // Would be true if we abandoned the implementation
  
  // Build metrics object
  const metrics: ImplementationMetrics = {
    issueNumber,
    title: issueTitle,
    implementedBy: 'ai', // or 'human' or 'hybrid'
    
    startTime,
    endTime,
    durationMinutes,
    
    tokensUsed,
    aiAutonomyLevel: determineAutonomyLevel(humanInterventions),
    humanInterventions,
    
    filesModified,
    linesAdded,
    linesDeleted,
    linesChanged: linesAdded + linesDeleted,
    
    testsAdded: testResults.testsAdded,
    testsPassing: testResults.passing,
    lintErrors: lintResults.errorCount,
    typeErrors: lintResults.typeErrorCount,
    
    firstAttemptSuccess,
    reworkRequired,
    scrapped,
    
    predictedScore: assessment?.score || 5,
    predictedHours: assessment?.estimatedHours || 0,
    actualHours,
    
    command: '/implement-issue',
    repoVersion: await getCurrentCommitHash(),
    classificationVersion: '1.0.0'
  };
  
  // Save metrics
  await storage.saveMetrics(metrics);
  
  // Show metrics summary
  console.log('\n📊 Implementation Metrics:');
  console.log(`   Duration: ${actualHours.toFixed(2)} hours`);
  console.log(`   Tokens Used: ${tokensUsed.toLocaleString()}`);
  console.log(`   Files Modified: ${filesModified.length}`);
  console.log(`   Lines Changed: +${linesAdded}/-${linesDeleted}`);
  console.log(`   Tests Added: ${testResults.testsAdded}`);
  console.log(`   First Attempt Success: ${firstAttemptSuccess
console.log(`   First Attempt Success: ${firstAttemptSuccess ? '✅' : '❌'}`);
  
  // Restore original fetch
  global.fetch = originalFetch;
}

function determineAutonomyLevel(interventions: number): 1 | 2 | 3 | 4 | 5 {
  if (interventions === 0) return 5; // Full autonomy
  if (interventions <= 2) return 4; // High autonomy
  if (interventions <= 5) return 3; // Medium autonomy
  if (interventions <= 10) return 2; // Low autonomy
  return 1; // Minimal autonomy (human-driven)
}

function parseGitDiff(diffOutput: string): {
  filesModified: string[];
  linesAdded: number;
  linesDeleted: number;
} {
  const lines = diffOutput.split('\n');
  const filesModified: string[] = [];
  let linesAdded = 0;
  let linesDeleted = 0;
  
  for (const line of lines) {
    // Match format: " path/to/file | 15 +++---"
    const match = line.match(/^\s*(.+?)\s*\|\s*(\d+)\s*([+-]+)/);
    if (match) {
      const [, file, , changes] = match;
      filesModified.push(file.trim());
      linesAdded += (changes.match(/\+/g) || []).length;
      linesDeleted += (changes.match(/-/g) || []).length;
    }
  }
  
  return { filesModified, linesAdded, linesDeleted };
}
```

**Acceptance Criteria:**
- [ ] All metrics captured accurately
- [ ] Token usage tracked for Anthropic API calls
- [ ] Human intervention count accurate
- [ ] Git diff parsed correctly
- [ ] Metrics saved before command exits
- [ ] Works for both successful and failed implementations

---

#### Issue 4.3: Metrics Capture in `/implement-epic`

**Files to Modify:**
- `scripts/commands/implement-epic.ts`

**Changes:**
```typescript
// Similar to implement-issue, but aggregate metrics across all child issues

export async function implementEpic(epicNumber: string): Promise<void> {
  
  const storage = new MetricsStorage();
  
  // Track epic-level metrics
  const epicStartTime = new Date();
  const childMetrics: ImplementationMetrics[] = [];
  
  // For each child issue
  for (const childIssue of childIssues) {
    
    // Implement issue (this will save individual metrics)
    await implementIssue(childIssue.number);
    
    // Load the metrics that were just saved
    const metrics = await storage.loadMetrics(childIssue.number);
    if (metrics) {
      childMetrics.push(metrics);
    }
  }
  
  // Calculate aggregated metrics
  const epicEndTime = new Date();
  const epicMetrics = {
    epicNumber,
    epicTitle,
    childIssues: childMetrics.length,
    
    totalDurationMinutes: (epicEndTime.getTime() - epicStartTime.getTime()) / 60000,
    totalTokensUsed: childMetrics.reduce((sum, m) => sum + m.tokensUsed, 0),
    totalHumanInterventions: childMetrics.reduce((sum, m) => sum + m.humanInterventions, 0),
    
    totalFilesModified: new Set(
      childMetrics.flatMap(m => m.filesModified)
    ).size,
    totalLinesAdded: childMetrics.reduce((sum, m) => sum + m.linesAdded, 0),
    totalLinesDeleted: childMetrics.reduce((sum, m) => sum + m.linesDeleted, 0),
    
    totalTestsAdded: childMetrics.reduce((sum, m) => sum + m.testsAdded, 0),
    allTestsPassing: childMetrics.every(m => m.testsPassing),
    
    firstAttemptSuccessRate: childMetrics.filter(m => m.firstAttemptSuccess).length / childMetrics.length,
    reworkRate: childMetrics.filter(m => m.reworkRequired).length / childMetrics.length,
    scrappedRate: childMetrics.filter(m => m.scrapped).length / childMetrics.length
  };
  
  // Save epic metrics
  await storage.saveEpicMetrics(epicMetrics);
  
  // Show epic summary
  console.log('\n📊 Epic Implementation Metrics:');
  console.log(`   Child Issues: ${childMetrics.length}`);
  console.log(`   Total Duration: ${(epicMetrics.totalDurationMinutes / 60).toFixed(2)} hours`);
  console.log(`   Total Tokens: ${epicMetrics.totalTokensUsed.toLocaleString()}`);
  console.log(`   Files Modified: ${epicMetrics.totalFilesModified}`);
  console.log(`   Lines Changed: +${epicMetrics.totalLinesAdded}/-${epicMetrics.totalLinesDeleted}`);
  console.log(`   Tests Added: ${epicMetrics.totalTestsAdded}`);
  console.log(`   First Attempt Success Rate: ${(epicMetrics.firstAttemptSuccessRate * 100).toFixed(1)}%`);
  console.log(`   Rework Rate: ${(epicMetrics.reworkRate * 100).toFixed(1)}%`);
}
```

**Acceptance Criteria:**
- [ ] Epic metrics aggregate all child metrics correctly
- [ ] Epic metrics saved separately from child metrics
- [ ] Summary statistics accurate
- [ ] Works for epics with 1-20 child issues

---

### Epic 5: Calibration & Learning Loop

**Goal:** System learns from completed implementations to improve predictions.

#### Issue 5.1: Calibration Algorithm

**Files to Create:**
- `scripts/utils/calibration.ts`

**Algorithm:**
```typescript
export interface CalibrationReport {
  generatedAt: Date;
  sampleSize: number;
  accuracyByScore: Record<ComplexityScore, AccuracyMetrics>;
  overallAccuracy: number;
  confidenceCalibration: ConfidenceCalibration;
  recommendations: string[];
}

export interface AccuracyMetrics {
  score: ComplexityScore;
  sampleSize: number;
  avgError: number;          // Avg |predicted - actual|
  percentWithin1: number;    // % within ±1 score
  percentWithin2: number;    // % within ±2 scores
}

export interface ConfidenceCalibration {
  // For each confidence bucket, how often we were right
  calibrationCurve: Array<{
    predictedConfidence: number; // 70, 75, 80, etc.
    actualAccuracy: number;      // % correct in this bucket
    sampleSize: number;
  }>;
}

export class Calibrator {
  
  /**
   * Calibrate classification system using recent implementations
   */
  public async calibrate(
    lookbackDays: number = 30
  ): Promise<CalibrationReport> {
    
    // Load recent metrics
    const cutoffDate = new Date();
    cutoffDate.setDate(cutoffDate.getDate() - lookbackDays);
    
    const allMetrics = await this.loadAllMetrics();
    const recentMetrics = allMetrics.filter(m => 
      m.startTime > cutoffDate && 
      m.actualScore !== undefined
    );
    
    if (recentMetrics.length < 10) {
      throw new Error('Need at least 10 completed issues with actual scores for calibration');
    }
    
    // Calculate accuracy by score
    const accuracyByScore = this.calculateAccuracyByScore(recentMetrics);
    
    // Calculate overall accuracy
    const overallAccuracy = this.calculateOverallAccuracy(recentMetrics);
    
    // Calibrate confidence
    const confidenceCalibration = this.calibrateConfidence(recentMetrics);
    
    // Generate recommendations
    const recommendations = this.generateRecommendations(
      accuracyByScore,
      confidenceCalibration,
      recentMetrics
    );
    
    return {
      generatedAt: new Date(),
      sampleSize: recentMetrics.length,
      accuracyByScore,
      overallAccuracy,
      confidenceCalibration,
      recommendations
    };
  }
  
  private calculateAccuracyByScore(
    metrics: ImplementationMetrics[]
  ): Record<ComplexityScore, AccuracyMetrics> {
    
    const byScore: Record<number, ImplementationMetrics[]> = {};
    
    // Group by predicted score
    for (const m of metrics) {
      const score = m.predictedScore;
      if (!byScore[score]) byScore[score] = [];
      byScore[score].push(m);
    }
    
    // Calculate accuracy for each score
    const result: Record<ComplexityScore, AccuracyMetrics> = {} as any;
    
    for (let score = 1; score <= 10; score++) {
      const group = byScore[score] || [];
      
      if (group.length === 0) {
        result[score as ComplexityScore] = {
          score: score as ComplexityScore,
          sampleSize: 0,
          avgError: 0,
          percentWithin1: 0,
          percentWithin2: 0
        };
        continue;
      }
      
      const errors = group.map(m => 
        Math.abs(m.predictedScore - (m.actualScore || m.predictedScore))
      );
      
      const avgError = errors.reduce((a, b) => a + b, 0) / errors.length;
      const within1 = errors.filter(e => e <= 1).length / errors.length;
      const within2 = errors.filter(e => e <= 2).length / errors.length;
      
      result[score as ComplexityScore] = {
        score: score as ComplexityScore,
        sampleSize: group.length,
        avgError,
        percentWithin1: within1 * 100,
        percentWithin2: within2 * 100
      };
    }
    
    return result;
  }
  
  private calculateOverallAccuracy(metrics: ImplementationMetrics[]): number {
    
    const exactMatches = metrics.filter(m => 
      m.predictedScore === m.actualScore
    ).length;
    
    return (exactMatches / metrics.length) * 100;
  }
  
  private calibrateConfidence(
    metrics: ImplementationMetrics[]
  ): ConfidenceCalibration {
    
    // Group by confidence buckets (70, 75, 80, 85, 90, 95, 100)
    const buckets = [70, 75, 80, 85, 90, 95, 100];
    const calibrationCurve: ConfidenceCalibration['calibrationCurve'] = [];
    
    for (const bucket of buckets) {
      // Find metrics in this confidence range (±2.5%)
      const inBucket = metrics.filter(m => {
        const confidence = m.confidence || 75;
        return Math.abs(confidence - bucket) <= 2.5;
      });
      
      if (inBucket.length === 0) continue;
      
      // Calculate actual accuracy for this bucket
      const correct = inBucket.filter(m => 
        Math.abs(m.predictedScore - (m.actualScore || m.predictedScore)) <= 1
      ).length;
      
      const actualAccuracy = (correct / inBucket.length) * 100;
      
      calibrationCurve.push({
        predictedConfidence: bucket,
        actualAccuracy,
        sampleSize: inBucket.length
      });
    }
    
    return { calibrationCurve };
  }
  
  private generateRecommendations(
    accuracyByScore: Record<ComplexityScore, AccuracyMetrics>,
    confidenceCalibration: ConfidenceCalibration,
    metrics: ImplementationMetrics[]
  ): string[] {
    
    const recommendations: string[] = [];
    
    // Check for scores with low accuracy
    for (const [score, accuracy] of Object.entries(accuracyByScore)) {
      if (accuracy.sampleSize >= 3 && accuracy.percentWithin1 < 60) {
        recommendations.push(
          `⚠️  Score ${score} has low accuracy (${accuracy.percentWithin1.toFixed(1)}% within ±1). ` +
          `Review ${accuracy.sampleSize} past issues at this score and adjust scoring algorithm.`
        );
      }
    }
    
    // Check for overconfidence
    for (const point of confidenceCalibration.calibrationCurve) {
      if (point.sampleSize >= 5) {
        const gap = point.predictedConfidence - point.actualAccuracy;
        if (gap > 15) {
          recommendations.push(
            `⚠️  System is overconfident at ${point.predictedConfidence}% confidence ` +
            `(actual accuracy: ${point.actualAccuracy.toFixed(1)}%). ` +
            `Reduce confidence for similar issues.`
          );
        }
      }
    }
    
    // Check for underconfidence
    for (const point of confidenceCalibration.calibrationCurve) {
      if (point.sampleSize >= 5) {
        const gap = point.actualAccuracy - point.predictedConfidence;
        if (gap > 15) {
          recommendations.push(
            `✅ System is underconfident at ${point.predictedConfidence}% confidence ` +
            `(actual accuracy: ${point.actualAccuracy.toFixed(1)}%). ` +
            `Can increase confidence for similar issues.`
          );
        }
      }
    }
    
    // Check sample size
    if (metrics.length < 20) {
      recommendations.push(
        `ℹ️  Sample size is small (${metrics.length}). ` +
        `Confidence in calibration will improve after 50+ completed issues.`
      );
    }
    
    return recommendations;
  }
}
```

**Acceptance Criteria:**
- [ ] Calibration runs on 10+ issues minimum
- [ ] Accuracy metrics match manual calculation
- [ ] Confidence calibration identifies over/under-confidence
- [ ] Recommendations actionable and specific
- [ ] Calibration report saved to `config/metrics/calibration/`

---

#### Issue 5.2: Automated Calibration Workflow

**Files to Create:**
- `scripts/commands/calibrate-classifier.ts`
- `.github/workflows/weekly-calibration.yml`

**Command Implementation:**
```typescript
#!/usr/bin/env node

import { Calibrator } from '../utils/calibration';
import { MetricsStorage } from '../utils/metrics-storage';

async function main() {
  console.log('🔄 Running classification calibration...\n');
  
  const calibrator = new Calibrator();
  
  try {
    // Run calibration on last 30 days
    const report = await calibrator.calibrate(30);
    
    // Display report
    console.log('📊 Calibration Report');
    console.log('═'.repeat(60));
    console.log(`Sample Size: ${report.sampleSize} issues`);
    console.log(`Overall Accuracy: ${report.overallAccuracy.toFixed(1)}%`);
    console.log();
    
    console.log('Accuracy by Score:');
    console.log('─'.repeat(60));
    console.log('Score | Samples | Avg Error | Within ±1 | Within ±2');
    console.log('─'.repeat(60));
    
    for (let score = 1; score <= 10; score++) {
      const acc = report.accuracyByScore[score as ComplexityScore];
      if (acc.sampleSize === 0) continue;
      
      console.log(
        `  ${score}   |   ${acc.sampleSize.toString().padStart(3)}   |   ` +
        `${acc.avgError.toFixed(2).padStart(4)}   |  ` +
        `${acc.percentWithin1.toFixed(1).padStart(5)}%  |  ` +
        `${acc.percentWithin2.toFixed(1).padStart(5)}%`
      );
    }
    
    console.log();
    console.log('Confidence Calibration:');
    console.log('─'.repeat(60));
    console.log('Predicted | Actual | Gap | Samples');
    console.log('─'.repeat(60));
    
    for (const point of report.confidenceCalibration.calibrationCurve) {
      const gap = point.predictedConfidence - point.actualAccuracy;
      const gapStr = gap > 0 ? `+${gap.toFixed(1)}` : gap.toFixed(1);
      
      console.log(
        `  ${point.predictedConfidence}%    |  ${point.actualAccuracy.toFixed(1)}%  | ` +
        `${gapStr.padStart(6)}% |   ${point.sampleSize}`
      );
    }
    
    if (report.recommendations.length > 0) {
      console.log();
      console.log('Recommendations:');
      console.log('─'.repeat(60));
      for (const rec of report.recommendations) {
        console.log(`${rec}\n`);
      }
    }
    
    // Save report
    const storage = new MetricsStorage();
    await storage.saveCalibrationReport(report);
    
    console.log('✅ Calibration complete\n');
    console.log(`Report saved to: config/metrics/calibration/${report.generatedAt.toISOString().split('T')[0]}.json`);
    
  } catch (error) {
    console.error('❌ Calibration failed:', error.message);
    process.exit(1);
  }
}

main();
```

**GitHub Workflow:**
```yaml
name: Weekly Calibration

on:
  schedule:
    # Run every Monday at 9 AM UTC
    - cron: '0 9 * * 1'
  workflow_dispatch: # Allow manual triggering

jobs:
  calibrate:
    runs-on: ubuntu-latest
    
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      
      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '18'
      
      - name: Install dependencies
        run: npm ci
      
      - name: Run calibration
        run: npm run calibrate
      
      - name: Commit calibration report
        run: |
          git config user.name "GitHub Actions"
          git config user.email "actions@github.com"
          git add config/metrics/calibration/
          git commit -m "chore: weekly calibration report" || echo "No changes to commit"
          git push
```

**Acceptance Criteria:**
- [ ] Calibration command runs successfully
- [ ] Report displays in terminal clearly
- [ ] Report saved to JSON file
- [ ] GitHub workflow runs weekly automatically
- [ ] Workflow can be triggered manually
- [ ] Commit pushed back to repo

---

### Epic 6: AI Productivity Dashboard

**Goal:** Visual dashboard showing AI development metrics and trends.

#### Issue 6.1: Dashboard Data Aggregation

**Files to Create:**
- `scripts/utils/dashboard-data.ts`

**Implementation:**
```typescript
export interface DashboardData {
  summary: SummaryStats;
  aiVsHuman: AIVsHumanStats;
  velocity: VelocityStats;
  quality: QualityStats;
  classification: ClassificationAccuracyStats;
}

export interface SummaryStats {
  totalIssues: number;
  totalHours: number;
  totalTokens: number;
  avgHoursPerIssue: number;
  aiContributionPercent: number;
}

export interface AIVsHumanStats {
  issuesByImplementer: {
    ai: number;
    human: number;
    hybrid: number;
  };
  linesByImplementer: {
    ai: number;
    human: number;
    hybrid: number;
  };
  hoursByImplementer: {
    ai: number;
    human: number;
    hybrid: number;
  };
  issuesByRiskLevel: {
    low: { ai: number; human: number };
    medium: { ai: number; human: number };
    high: { ai: number; human: number };
  };
}

export interface VelocityStats {
  issuesPerWeek: Array<{
    week: string;
    ai: number;
    human: number;
  }>;
  avgHoursByScore: Array<{
    score: ComplexityScore;
    ai: number | null;
    human: number | null;
  }>;
  velocityTrend: 'improving' | 'stable' | 'declining';
}

export interface QualityStats {
  firstTimeSuccessRate: number;
  reworkRate: number;
  scrappedRate: number;
  bugRatePerKLOC: {
    ai: number;
    human: number;
  };
  testCoverage: {
    ai: number;
    human: number;
  };
}

export interface ClassificationAccuracyStats {
  overallAccuracy: number;
  accuracyTrend: Array<{
    date: string;
    accuracy: number;
  }>;
  predictionVariance: Array<{
    score: ComplexityScore;
    avgError: number;
  }>;
  topUnderestimates: string[];
  topOverestimates: string[];
}

export class DashboardAggregator {
  
  public async aggregateData(
    startDate: Date,
    endDate: Date
  ): Promise<DashboardData> {
    
    const metrics = await this.loadMetricsInRange(startDate, endDate);
    
    return {
      summary: this.calculateSummary(metrics),
      aiVsHuman: this.calculateAIVsHuman(metrics),
      velocity: this.calculateVelocity(metrics),
      quality: this.calculateQuality(metrics),
      classification: await this.calculateClassificationAccuracy(metrics)
    };
  }
  
  private calculateSummary(metrics: ImplementationMetrics[]): SummaryStats {
    
    const totalIssues = metrics.length;
    const totalHours = metrics.reduce((sum, m) => sum + m.actualHours, 0);
    const totalTokens = metrics.reduce((sum, m) => sum + m.tokensUsed, 0);
    const avgHoursPerIssue = totalHours / totalIssues;
    
    const aiLines = metrics
      .filter(m => m.implementedBy === 'ai')
      .reduce((sum, m) => sum + m.linesChanged, 0);
    
    const totalLines = metrics.reduce((sum, m) => sum + m.linesChanged, 0);
    
    const aiContributionPercent = (aiLines / totalLines) * 100;
    
    return {
      totalIssues,
      totalHours,
      totalTokens,
      avgHoursPerIssue,
      aiContributionPercent
    };
  }
  
  private calculateVelocity(metrics: ImplementationMetrics[]): VelocityStats {
    
    // Group by week
    const byWeek = new Map<string, { ai: number; human: number }>();
    
    for (const m of metrics) {
      const week = this.getWeekString(m.startTime);
      if (!byWeek.has(week)) {
        byWeek.set(week, { ai: 0, human: 0 });
      }
      
      const stats = byWeek.get(week)!;
      if (m.implementedBy === 'ai') {
        stats.ai++;
      } else {
        stats.human++;
      }
    }
    
    const issuesPerWeek = Array.from(byWeek.entries())
      .map(([week, stats]) => ({ week, ...stats }))
      .sort((a, b) => a.week.localeCompare(b.week));
    
    // Calculate avg hours by score
    const byScore = new Map<ComplexityScore, { ai: number[]; human: number[] }>();
    
    for (const m of metrics) {
      const score = m.predictedScore as ComplexityScore;
      if (!byScore.has(score)) {
        byScore.set(score, { ai: [], human: [] });
      }
      
      const hours = byScore.get(score)!;
      if (m.implementedBy === 'ai') {
        hours.ai.push(m.actualHours);
      } else {
        hours.human.push(m.actualHours);
      }
    }
    
    const avgHoursByScore = Array.from(byScore.entries())
      .map(([score, hours]) => ({
        score,
        ai: hours.ai.length > 0 
          ? hours.ai.reduce((a, b) => a + b, 0) / hours.ai.length 
          : null,
        human: hours.human.length > 0 
          ? hours.human.reduce((a, b) => a + b, 0) / hours.human.length 
          : null
      }))
      .sort((a, b) => a.score - b.score);
    
    // Determine velocity trend
    const velocityTrend = this.determineVelocityTrend(issuesPerWeek);
    
    return {
      issuesPerWeek,
      avgHoursByScore,
      velocityTrend
    };
  }
  
  private getWeekString(date: Date): string {
    const year = date.getFullYear();
    const week = this.getWeekNumber(date);
    return `${year}-W${week.toString().padStart(2, '0')}`;
  }
  
  private getWeekNumber(date: Date): number {
    const d = new Date(Date.UTC(date.getFullYear(), date.getMonth(), date.getDate()));
    const dayNum = d.getUTCDay() || 7;
    d.setUTCDate(d.getUTCDate() + 4 - dayNum);
    const yearStart = new Date(Date.UTC(d.getUTCFullYear(), 0, 1));
    return Math.ceil((((d.getTime() - yearStart.getTime()) / 86400000) + 1) / 7);
  }
  
  private determineVelocityTrend(
    issuesPerWeek: Array<{ week: string; ai: number; human: number }>
  ): 'improving' | 'stable' | 'declining' {
    
    if (issuesPerWeek.length < 4) return 'stable';
    
    // Compare recent 2 weeks vs. previous 2 weeks
    const recent = issuesPerWeek.slice(-2);
    const previous = issuesPerWeek.slice(-4, -2);
    
    const recentTotal = recent.reduce((sum, w) => sum + w.ai + w.human, 0);
    const previousTotal = previous.reduce((sum, w) => sum + w.ai + w.human, 0);
    
    const change = (recentTotal - previousTotal) / previousTotal;
    
    if (change > 0.1) return 'improving';
    if (change < -0.1) return 'declining';
    return 'stable';
  }
}
```

**Acceptance Criteria:**
- [ ] Aggregates data across date ranges
- [ ] Handles missing data gracefully
- [ ] Calculations verified against manual spreadsheet
- [ ] Supports filtering by sprint, release, or custom date range
- [ ] Performance: <1 second for 1000 metrics

---

### Epic 7: Automated Workflow Orchestration (`/implement-autopilot`)

**Goal:** Build `/implement-autopilot` command that uses classification scores to intelligently automate releases.

#### Issue 7.1: Automation Rules Engine

**Files to Create:**
- `scripts/utils/automation-rules.ts`
- `config/schemas/workflow-config.schema.json`

**Automation Decision Logic:**
```typescript
export type AutomationMode = 'low' | 'medium';

export interface AutomationDecision {
  shouldAutoImplement: boolean;
  requiresHumanApproval: boolean;
  shouldStop: boolean;
  rationale: string;
  score: number;
  confidence: number;
}

export interface WorkflowConfig {
  automation: {
    min_confidence: number;
    
    low_risk: {
      auto_implement_max_score: number;
      batch_approvals: boolean;
    };
    
    medium_risk: {
      auto_implement_max_score: number;
      auto_proceed_if_ci_passes: boolean;
      min_test_coverage: number;
    };
    
    stop_conditions: {
      max_score: number;
      max_aggregate_score: number;
      breaking_changes: boolean;
      low_coverage: boolean;
    };
  };
}

export class AutomationRulesEngine {
  
  private config: WorkflowConfig;
  
  constructor(config: WorkflowConfig) {
    this.config = config;
  }
  
  /**
   * Decide if issue should be auto-implemented
   */
  public decideAutomation(
    issue: IssueAssessment,
    mode: AutomationMode
  ): AutomationDecision {
    
    // STOP CONDITIONS (override everything)
    if (issue.score >= this.config.automation.stop_conditions.max_score) {
      return {
        shouldAutoImplement: false,
        requiresHumanApproval: false,
        shouldStop: true,
        rationale: `Score ${issue.score}/10 exceeds max_score threshold (${this.config.automation.stop_conditions.max_score})`,
        score: issue.score,
        confidence: issue.confidence
      };
    }
    
    if (issue.confidence < this.config.automation.min_confidence) {
      return {
        shouldAutoImplement: false,
        requiresHumanApproval: false,
        shouldStop: true,
        rationale: `Confidence ${issue.confidence}% below minimum (${this.config.automation.min_confidence}%)`,
        score: issue.score,
        confidence: issue.confidence
      };
    }
    
    // Check for breaking changes
    if (this.config.automation.stop_conditions.breaking_changes) {
      const hasBreaking = issue.riskFactors.some(r => 
        r.category.toLowerCase().includes('breaking')
      );
      
      if (hasBreaking) {
        return {
          shouldAutoImplement: false,
          requiresHumanApproval: false,
          shouldStop: true,
          rationale: 'Breaking changes detected - requires human review',
          score: issue.score,
          confidence: issue.confidence
        };
      }
    }
    
    // MODE-SPECIFIC RULES
    if (mode === 'low') {
      const maxScore = this.config.automation.low_risk.auto_implement_max_score;
      
      if (issue.score <= maxScore) {
        return {
          shouldAutoImplement: true,
          requiresHumanApproval: false,
          shouldStop: false,
          rationale: `Low mode: score ${issue.score} ≤ ${maxScore} - auto-implementing`,
          score: issue.score,
          confidence: issue.confidence
        };
      } else {
        return {
          shouldAutoImplement: false,
          requiresHumanApproval: true,
          shouldStop: false,
          rationale: `Low mode: score ${issue.score} > ${maxScore} - requires approval`,
          score: issue.score,
          confidence: issue.confidence
        };
      }
    }
    
    if (mode === 'medium') {
      const maxScore = this.config.automation.medium_risk.auto_implement_max_score;
      
      if (issue.score <= maxScore) {
        return {
          shouldAutoImplement: true,
          requiresHumanApproval: false,
          shouldStop: false,
          rationale: `Medium mode: score ${issue.score} ≤ ${maxScore} - auto-implementing`,
          score: issue.score,
          confidence: issue.confidence
        };
      } else {
        return {
          shouldAutoImplement: false,
          requiresHumanApproval: true,
          shouldStop: false,
          rationale: `Medium mode: score ${issue.score} > ${maxScore} - requires approval`,
          score: issue.score,
          confidence: issue.confidence
        };
      }
    }
    
    // Default: require approval
    return {
      shouldAutoImplement: false,
      requiresHumanApproval: true,
      shouldStop: false,
      rationale: 'Default: requires human approval',
      score: issue.score,
      confidence: issue.confidence
    };
  }
  
  /**
   * Calculate aggregate score for epic
   */
  public calculateAggregateScore(
    assessments: IssueAssessment[]
  ): number {
    
    // Weighted average (higher scores weighted more heavily)
    const totalWeight = assessments.reduce((sum, a) => sum + a.score, 0);
    const weightedSum = assessments.reduce((sum, a) => 
      sum + (a.score * a.score), 0
    );
    
    return weightedSum / totalWeight;
  }
  
  /**
   * Check if epic should stop autopilot
   */
  public shouldStopEpic(
    assessments: IssueAssessment[]
  ): { shouldStop: boolean; reason?: string } {
    
    // Check aggregate score
    const aggregateScore = this.calculateAggregateScore(assessments);
    if (aggregateScore > this.config.automation.stop_conditions.max_aggregate_score) {
      return {
        shouldStop: true,
        reason: `Aggregate epic score ${aggregateScore.toFixed(1)} exceeds threshold ${this.config.automation.stop_conditions.max_aggregate_score}`
      };
    }
    
    // Check if any individual issue should stop
    for (const assessment of assessments) {
      if (assessment.score >= this.config.automation.stop_conditions.max_score) {
        return {
          shouldStop: true,
          reason: `Issue #${assessment.issueNumber} has score ${assessment.score} (too high for autopilot)`
        };
      }
    }
    
    return { shouldStop: false };
  }
}
```

**Acceptance Criteria:**
- [ ] Decision logic matches PRD requirements exactly
- [ ] All stop conditions properly enforced
- [ ] Config validation with Zod schema
- [ ] Unit tests cover all decision branches
- [ ] Audit trail logged for every decision

---

#### Issue 7.2: Autopilot Orchestration Logic

**Files to Create:**
- `scripts/commands/implement-autopilot.ts`

**Core Orchestration:**
```typescript
export async function implementAutopilot(options: AutopilotOptions): Promise<void> {
  
  const mode = options.mode || 'low';
  const dryRun = options.dryRun || false;
  
  console.log(`🤖 Starting autopilot in ${mode} mode${dryRun ? ' (DRY RUN)' : ''}...\n`);
  
  // Load config
  const config = await loadWorkflowConfig();
  const rulesEngine = new AutomationRulesEngine(config);
  
  // STEP 1: Detect & validate release file
  console.log('Step 1: Detecting release file...');
  const releaseFile = await detectReleaseFile(options.release);
  await validateReleaseFile(releaseFile);
  
  // Run analyze-repo for context
  if (!dryRun) {
    await runCommand('/analyze-repo');
  }
  
  // STEP 2: Create GitHub structure
  console.log('\nStep 2: Creating GitHub structure...');
  const createdStructure = dryRun 
    ? await simulateScopeRelease(releaseFile)
    : await runCommand(`/scope-release --file ${releaseFile} --no-batch-confirm`);
  
  const epicNumbers = createdStructure.epics.map(e => e.number);
  console.log(`✅ Created ${epicNumbers.length} epics`);
  
  // STEP 3: Analyze all epics (parallel)
  console.log('\nStep 3: Analyzing epics...');
  const epicAssessments = await Promise.all(
    epicNumbers.map(async epicNum => {
      const issues = await getEpicIssues(epicNum);
      const assessments = await Promise.all(
        issues.map(issue => classifyIssue(issue.number))
      );
      return { epicNum, assessments };
    })
  );
  
  // Check stop conditions
  for (const epic of epicAssessments) {
    const stopCheck = rulesEngine.shouldStopEpic(epic.assessments);
    if (stopCheck.shouldStop) {
      console.log(`\n❌ AUTOPILOT STOPPED: ${stopCheck.reason}`);
      console.log('Please review and implement manually.');
      return;
    }
  }
  
  // Calculate automation decisions
  const allDecisions = epicAssessments.flatMap(epic =>
    epic.assessments.map(assessment =>
      rulesEngine.decideAutomation(assessment, mode)
    )
  );
  
  const autoCount = allDecisions.filter(d => d.shouldAutoImplement).length;
  const approvalCount = allDecisions.filter(d => d.requiresHumanApproval).length;
  
  console.log(`\n📋 Automation Plan:`);
  console.log(`   Auto-implement: ${autoCount} issues`);
  console.log(`   Requires approval: ${approvalCount} issues`);
  
  // Human checkpoint (if mode === 'low' or has approvals)
  if (mode === 'low' || approvalCount > 0) {
    const proceed = await promptUser('\nProceed with implementation? [Y/n]');
    if (!proceed) {
      console.log('❌ Autopilot cancelled by user');
      return;
    }
  }
  
  if (dryRun) {
    console.log('\n✅ Dry run complete - no changes made');
    return;
  }
  
  // STEP 4: Implement epics
  console.log('\nStep 4: Implementing epics...');
  const metrics: AutopilotRunMetrics = {
    runId: generateRunId(),
    startTime: new Date(),
    endTime: new Date(),
    mode,
    epicCount: epicNumbers.length,
    totalIssues: allDecisions.length,
    issuesByScore: calculateScoreDistribution(allDecisions),
    autoImplemented: autoCount,
    humanApproved: approvalCount,
    stopped: 0,
    estimatedManualMinutes: 0, // Calculate based on issue count
    actualMinutes: 0,
    timeSavedMinutes: 0,
    timeSavedPercent: 0,
    firstTimeSuccess: 0,
    reworkRequired: 0,
    scrapped: 0
  };
  
  for (const epic of epicAssessments) {
    await runCommand(`/implement-epic ${epic.epicNum} --auto-approve-tests --smart-sequence`);
  }
  
  // STEP 5: Report
  metrics.endTime = new Date();
  metrics.actualMinutes = (metrics.endTime.getTime() - metrics.startTime.getTime()) / 60000;
  
  await saveAutopilotMetrics(metrics);
  
  console.log('\n✅ Autopilot complete!');
  console.log(`   Duration: ${metrics.actualMinutes.toFixed(1)} minutes`);
  console.log(`   PRs created: ${epicNumbers.length}`);
}
```

**Acceptance Criteria:**
- [ ] Dry-run mode shows plan without executing
- [ ] Stop conditions enforced correctly
- [ ] Metrics captured for every autopilot run
- [ ] Works with both current and minor releases
- [ ] Error handling with clear recovery instructions

---

## Testing Strategy

### Unit Tests
- All utilities in `scripts/utils/` must have >80% coverage
- Test files colocated: `*.test.ts` next to `*.ts`
- Use Jest with TypeScript support

### Integration Tests
- Test full classification workflow (parse → score → store)
- Test metrics capture in implement commands
- Test dashboard generation end-to-end
- Mock GitHub API calls

### End-to-End Tests
- Run on 20+ real past issues
- Validate classification accuracy matches manual scoring
- Verify metrics files created correctly
- Confirm dashboard displays accurate data

---

## Deployment Plan

### Phase 1: Foundation (Weeks 1-4)
- Epic 1: Classification utilities
- Epic 2: Enhanced repo analysis
- Deploy to dev, test on historical issues

### Phase 2: Integration (Weeks 5-6)
- Epic 3: AI self-assessment
- Epic 4: Metrics tracking
- Deploy to production, run on new issues

### Phase 3: Calibration (Weeks 7-8)
- Epic 5: Calibration loop
- Epic 6: Dashboard
- Weekly calibration automated

### Phase 4: Automation (Weeks 9-10)
- Epic 7: implement-autopilot
- Test in low mode for 2 weeks
- Enable medium mode after validation

---

## Rollback Plan

If classification system causes issues:
1. Disable auto-classification in `/scope-release` (feature flag)
2. Continue using manual complexity assessment
3. Keep metrics capture running (data still valuable)
4. Fix classification algorithm offline
5. Re-enable after validation

---

## Success Criteria

### MVP Complete When:
- [ ] All 7 epics implemented and tested
- [ ] Classification accuracy >75% on 20+ test issues
- [ ] Metrics captured for 20+ implementations
- [ ] Dashboard generates successfully
- [ ] `/implement-autopilot` works in low mode
- [ ] Documentation complete
- [ ] Team trained on system

### Production Ready When:
- [ ] Classification accuracy sustained >80% for 4 weeks
- [ ] Calibration loop running weekly
- [ ] Dashboard used in sprint reviews
- [ ] `/implement-autopilot` proven on 5+ releases
- [ ] No major bugs or regressions

---

**Last Updated:** 2025-01-19  
**Version:** 1.0.0  
**Status:** Draft - Ready for Implementation