# Evals and release confidence

Pi Troupe evals should prove three things before public launch:

1. **Package health** — install, typecheck, tests, pack smoke.
2. **Simulation contract** — prompts request structured JSON, parser tolerates drift, reports stay useful even when JSON is malformed.
3. **Product usefulness** — templates produce credible reports with assumptions, disagreement, findings, limitations, and next steps.

## Current automated checks

```bash
npm run check
npm run pack:smoke
node -e "import('./extensions/pi-troupe/index.ts').then(()=>console.log('extension import ok'))"
```

## Manual smoke checklist

Run in a fresh terminal:

```text
/troupe templates
/troupe demo roast-my-readme
/troupe status
/troupe logs
/troupe share
```

Pass criteria:

- Demo starts without setup beyond `pi install`.
- Run produces markdown report.
- Report includes assumptions/limitations.
- Logs include `state.json`, `events.jsonl`, `prompt.md`, `report.md`, `manifest.json`, `responses.csv`, `responses.jsonl`.
- Full-export surveys include `survey-manifest.json` and `survey_batch_started` / `survey_batch_finished` events.
- Share card is short enough to paste into GitHub issue/social post.

## Future eval ideas

- Golden report snapshots for each built-in template.
- Persona adherence rubric: each persona shows distinct goals/constraints.
- Disagreement rubric: focus group includes at least one meaningful tension.
- Survey schema rubric: survey mode includes `sampleSize`, aggregates, segment breakdown.
- Cost ceiling smoke: demo completes under a documented token budget.
