# Model/tool metrics UI proposal

## Goal

Add model-attributed tool metrics without turning `/tool-stats` into a wall of data.
The report should answer:

1. Which models call which tools?
2. Which tool calls finish successfully vs error?
3. Where are the likely weak spots, without pretending tool success equals task success?

## Data semantics

Recommended new log shape keeps existing records backward-compatible:

```jsonc
// Existing call record, enriched going forward.
{"ts": 1790700000000, "kind": "tool", "name": "edit", "provider": "anthropic", "model": "claude-sonnet-4-5", "modelName": "Claude Sonnet 4.5", "thinking": "high", "toolCallId": "call_abc", "session": "..."}

// New outcome record appended after execution.
{"ts": 1790700001250, "kind": "tool-result", "name": "edit", "provider": "anthropic", "model": "claude-sonnet-4-5", "thinking": "high", "toolCallId": "call_abc", "success": false, "durationMs": 1250, "session": "..."}
```

Implementation notes:

- Use `tool_call` for attempts and `ctx.model` / `pi.getThinkingLevel()` for attribution.
- Use `tool_result` for outcomes and `!event.isError` for success.
- Correlate with an in-memory `Map<toolCallId, attribution>` so result records keep the same model even in parallel tool mode.
- Cap/evict the pending map, because blocked/cancelled calls may never emit a result.
- Do not record tool arguments; current privacy posture remains intact.
- In reports, compute `ok%` as `ok / (ok + fail)`. Show `unknown` separately for legacy, blocked, or missing-result attempts.

## Recommended presentation: brief default + detail subview

### Layer 1: keep `/tool-stats [days]` brief

Add one compact model/tool snapshot near the top of the existing report:

```text
Usage over last 30d  (log: 2,913 events total, 721 in window)

--- MODEL / TOOL SNAPSHOT ---
  model                         calls  ok  fail  unk  ok%  top tools             weak spot
  anthropic/claude-sonnet-4-5     312  286    11   15  96%  read, bash, edit      edit 6 fail
  openai/gpt-5-codex              128  110     9    9  92%  read, bash, write     bash 5 fail
  google/gemini-3-pro              64   55     6    3  90%  read, web_search      edit 3 fail

--- SKILLS ---
  name      used  last
  linear      12  2026-05-29
  grind        4  2026-05-20

...existing sections unchanged...
```

Why this works:

- The default report stays one screen-ish and still serves the prune-candidate use case.
- Users immediately see whether model/tool behavior looks healthy.
- The `weak spot` column turns the data into an action prompt without overexplaining.

### Layer 2: add a detail view

Add `/tool-stats model-tools [days]` as the canonical detail view, with `/tool-stats mt [days]` kept as a compatibility alias. This view can be longer because the user asked for model/tool details explicitly.

```text
Model Tool Effectiveness (30d)
Attempts: 504  Results: 477  OK: 451  Fail: 26  Unknown: 27
Note: ok% is tool-result success, not proof of task success.

--- BY MODEL ---
  model                         calls  ok  fail  unk  ok%  tools  top tools
  anthropic/claude-sonnet-4-5     312  286    11   15  96%     12  read 118, bash 91, edit 43
  openai/gpt-5-codex              128  110     9    9  92%      8  read 54, bash 31, write 18
  google/gemini-3-pro              64   55     6    3  90%      7  read 27, web_search 14, edit 9

--- BY TOOL ---
  tool            calls  ok  fail  unk  ok%  best model                    worst model
  read              199 197     0    2 100%  anthropic/claude-sonnet-4-5   -
  bash              136 119    11    6  92%  anthropic/claude-sonnet-4-5   openai/gpt-5-codex 5 fail
  edit               61  49     9    3  84%  anthropic/claude-sonnet-4-5   google/gemini-3-pro 3 fail
  write              28  27     0    1 100%  openai/gpt-5-codex            -

--- FAILURE HOTSPOTS ---
  model                         tool   fail  calls  ok%  likely meaning
  anthropic/claude-sonnet-4-5   edit      6     43  86%  exact-match misses / stale file reads
  openai/gpt-5-codex            bash      5     31  84%  command/environment failures
  google/gemini-3-pro           edit      3      9  67%  edits may need smaller context or reads

--- MODEL × TOP TOOLS ---
  model                         read       bash       edit       write      other
  anthropic/claude-sonnet-4-5   118/118    86/91      37/43      14/14      31/46
  openai/gpt-5-codex             53/54      26/31      11/13      18/18       2/12
  google/gemini-3-pro            27/27       8/10       6/9        0/0       14/18
  cell format: ok/calls
```

## Alternate presentation options

### Option A — only a single larger `/tool-stats`

Add all model/tool tables into the existing default report.

- Pro: one command only.
- Con: mixes pruning with model behavior and can get long quickly.

### Option B — interactive drill-down menu

Use `ctx.ui.select()` as a table of contents:

```text
Usage Stats (30d)
  Summary dashboard
  Model/tool snapshot
  By model
  By tool
  Failure hotspots
  Prune candidates
```

- Pro: shortest screens; feels like a mini dashboard.
- Con: more clicks, harder to copy/share, and more implementation complexity.

### Option C — recommended hybrid

Default `/tool-stats [days]` gets only the compact snapshot; a detail command/subview shows the full model/tool report.

- Pro: brief by default, deep on demand, compatible with the existing scrollable selector.
- Con: introduces one more subcommand/alias to document.

## Recommendation

Implement Option C first.

It gives model/tool visibility immediately without disrupting the existing prune report. If the detailed report proves too long, we can later wrap it in the interactive drill-down menu without changing the underlying metrics schema.
