---
name: experiment-loop
description: "Autonomous experiment loop: hypothesize > modify > test > evaluate > keep/discard > repeat. Run N experiments automatically with measurable metrics. Works for performance optimization, A/B testing, prompt engineering, and any measurable improvement task."
---

# Experiment Loop

Autonomous, iterative improvement inspired by Karpathy's autoresearch methodology. Define a metric, set a target, and let the loop run until the target is met or the iteration limit is reached.

## The 5-Step Loop

```
1. HYPOTHESIZE  -> Form a specific, falsifiable improvement hypothesis
2. MODIFY       -> Apply the minimal code/config/prompt change
3. TEST         -> Run the measurement suite (benchmarks, tests, evals)
4. EVALUATE     -> Compare result against baseline and previous best
5. DECIDE       -> KEEP if better, DISCARD (git stash pop --index) if worse
      |
   Repeat until target met OR max_iterations reached
```

Each iteration is atomic: one hypothesis, one change, one measurement, one decision.

## Experiment Definition

Define an experiment in your task or in `thoughts/EXPERIMENTS.md`:

```yaml
experiment:
  name: "reduce-api-latency"
  metric: "p95 response time (ms)"
  baseline: 340
  target: 200
  direction: minimize          # minimize | maximize
  max_iterations: 10           # hard cap, never exceed
  measurement_cmd: "npm run bench:api"
  measurement_key: "p95"       # JSON key from bench output
  scope: "src/api/"            # files the loop is allowed to touch
```

### Key Fields

| Field | Description |
|-------|-------------|
| `metric` | Human-readable name of what you are measuring |
| `baseline` | Measured value before any changes (run this first) |
| `target` | Success condition -- loop exits when this is met |
| `direction` | `minimize` for latency/size, `maximize` for coverage/score |
| `max_iterations` | Safety cap, default 10, absolute maximum 10 |
| `measurement_cmd` | Shell command that produces JSON with the metric value |
| `scope` | Directories/files the loop is allowed to modify |

## Safety Protocol

Before every experiment iteration:

```bash
# Save current state
git stash push -u -m "experiment-loop: iteration N baseline"

# Run experiment
# ... apply hypothesis change ...
# ... run measurement ...

# Decision
if result is better:
    git stash drop          # keep changes, discard stash
else:
    git stash pop --index   # restore exactly: staged + unstaged
```

Never skip the stash. Never accumulate multiple iterations without a decision checkpoint. If the measurement command fails or times out, treat it as DISCARD.

## Agent Integration

The experiment loop coordinates three vibecosystem agents:

| Phase | Agent | Role |
|-------|-------|------|
| Hypothesize | `profiler` | Identify bottlenecks, suggest what to change |
| Modify | `spark` | Apply the focused code change |
| Test + Evaluate | `verifier` / `tdd-guide` | Run benchmarks, tests, evals and parse results |

Spawn `profiler` once at the start to get the initial hypothesis queue. Then run `spark` + `verifier` in tight loops per iteration.

## Example Experiments

### Bundle Size Reduction

```yaml
experiment:
  name: "optimize-bundle-size"
  metric: "gzipped bundle size (KB)"
  baseline: 420
  target: 300
  direction: minimize
  max_iterations: 10
  measurement_cmd: "npm run build && node scripts/measure-bundle.js"
  measurement_key: "gzipped_kb"
  scope: "src/"
```

Hypothesis queue to try in order:
1. Add tree-shaking for unused lodash imports (use named imports)
2. Replace `moment` with `date-fns` (smaller footprint)
3. Move large dependencies to dynamic `import()` at route boundaries
4. Enable `usedExports: true` in webpack/rollup config
5. Replace `axios` with native `fetch` wrapper

### API Latency

```yaml
experiment:
  name: "reduce-api-latency"
  metric: "p95 response time (ms)"
  baseline: 340
  target: 200
  direction: minimize
  max_iterations: 8
  measurement_cmd: "npm run bench:api"
  measurement_key: "p95"
  scope: "src/api/"
```

Hypothesis queue:
1. Add Redis cache for repeated DB reads (TTL 60s)
2. Replace N+1 queries with single JOIN query
3. Add connection pool sizing (`max: 20`)
4. Move synchronous validation to async parallel (`Promise.all`)
5. Add response compression (gzip middleware)

### Test Coverage

```yaml
experiment:
  name: "improve-test-coverage"
  metric: "line coverage (%)"
  baseline: 64
  target: 80
  direction: maximize
  max_iterations: 10
  measurement_cmd: "npm test -- --coverage --json > coverage.json"
  measurement_key: "coverageMap.total.lines.pct"
  scope: "src/"
```

### Prompt Engineering (LLM Eval)

```yaml
experiment:
  name: "improve-extraction-accuracy"
  metric: "extraction F1 score"
  baseline: 0.71
  target: 0.85
  direction: maximize
  max_iterations: 10
  measurement_cmd: "python eval/run_evals.py --output eval/results.json"
  measurement_key: "f1"
  scope: "prompts/"
```

## Results Log Format

Append each iteration result to `thoughts/EXPERIMENTS.md`:

```markdown
## Experiment: reduce-api-latency
Started: 2026-04-07T10:00:00Z
Baseline: 340ms | Target: 200ms | Direction: minimize

### Iteration 1
- Hypothesis: Add Redis cache for repeated DB reads
- Change: `src/api/users.ts` lines 45-67 -- wrap DB call with cache layer
- Result: 280ms (improvement: -60ms, -17.6%)
- Decision: KEEP
- Cumulative best: 280ms

### Iteration 2
- Hypothesis: Replace N+1 queries with JOIN
- Change: `src/api/users.ts` lines 89-102 -- rewrite fetchWithPosts()
- Result: 210ms (improvement: -70ms, -25%)
- Decision: KEEP
- Cumulative best: 210ms

### Iteration 3
- Hypothesis: Add connection pool sizing max:20
- Change: `src/db/pool.ts` line 12 -- max: 10 -> 20
- Result: 215ms (regression: +5ms)
- Decision: DISCARD (restored via git stash pop)
- Cumulative best: 210ms

### Final Result
- Target: 200ms | Achieved: 210ms | Status: NEAR_MISS (within 5%)
- Iterations: 3 of 10 used
- Total improvement: -38% from baseline
```

## Iteration Limits and Exit Conditions

| Condition | Action |
|-----------|--------|
| Target met | EXIT -- log SUCCESS, keep all accumulated changes |
| max_iterations reached | EXIT -- log PARTIAL, keep best achieved state |
| 3 consecutive DISCARDs | PAUSE -- re-run profiler for new hypothesis queue |
| Measurement command fails | DISCARD current iteration, continue loop |
| Git stash fails | STOP -- do not continue, report error |

## Running the Loop

Invoke this skill by describing the experiment:

```
Use experiment-loop to reduce the API p95 latency from 340ms to under 200ms.
Baseline measurement: npm run bench:api
Max iterations: 8
Scope: src/api/
```

The loop will:
1. Read any existing `thoughts/EXPERIMENTS.md` for prior runs on the same metric
2. Ask `profiler` for an ordered hypothesis queue
3. Execute iterations with safety stashing
4. Log each result immediately after measurement
5. Report final state with all changes that were kept

## Hard Limits

- Maximum 10 experiments per invocation (no exceptions)
- Scope must be specified -- loop will not touch files outside scope
- Measurement command must be deterministic (no unbounded network calls)
- Total wall-clock time cap: 30 minutes (prevents runaway loops)
- Never auto-merge to main -- changes stay on current branch