# pi-image-subagent

A [Pi](https://github.com/badlogic/pi-mono) extension that gives **non-vision models the ability to analyze images** by delegating to a vision-capable subagent.

If you're running a model that can't see images — a code model, a small local model, anything without image input support — this extension adds an `analyze_image` tool that your agent can call to hand off image analysis to a model that *can* see.

## The Problem

Most coding agents run on text-only models. When you paste a screenshot, a mockup, or any image into the Pi TUI, the model can't do anything with it — it just sees a file path. This extension bridges that gap.

## How It Works

```
  You paste an image or reference a file
                 │
                 ▼
  ┌─────────────────────────────────────┐
  │  Your agent (non-vision model)       │
  │                                      │
  │  "What's in this screenshot?"         │
  │  calls: analyze_image({               │
  │    images: ["/tmp/ss.png"],           │
  │    question: "What's shown?"          │
  │  })                                   │
  └──────────────┬──────────────────────┘
                 │
                 ▼
  ┌─────────────────────────────────────┐
  │  Vision subagent (gemma4:31b-cloud)  │
  │                                      │
  │  Has only the `read` tool.           │
  │  1. Reads each image file            │
  │  2. Sees the image content           │
  │  3. Answers the question             │
  │  4. Returns plain text description   │
  └──────────────┬──────────────────────┘
                 │
                 ▼
  Your agent gets back a text description
  and continues as if it saw the image
```

- The subagent runs in its own isolated `pi` process — no shared context, no session pollution.
- It only has the `read` tool — it can't modify files, run commands, or do anything beyond looking at the images you specified.
- Each call is completely stateless.

## Installation

### Symlink (recommended)

This way edits to the source file are picked up immediately (after `/reload`):

```bash
mkdir -p ~/.pi/agent/extensions/analyze-image
ln -sf "$(pwd)/analyze-image/index.ts" ~/.pi/agent/extensions/analyze-image/index.ts
```

### Copy

```bash
mkdir -p ~/.pi/agent/extensions/analyze-image
cp analyze-image/index.ts ~/.pi/agent/extensions/analyze-image/index.ts
```

### Verify

Restart Pi (or run `/reload`). You should see `analyze-image` listed under Extensions in the startup output. The `analyze_image` tool will be available to the model.

## Configuration

The extension works out of the box with defaults. To customize, create a config file:

```
~/.pi/agent/extensions/analyze-image/config.json
```

A starter config is provided at `analyze-image/config.example.json` — copy it as a starting point:

```bash
cp analyze-image/config.example.json ~/.pi/agent/extensions/analyze-image/config.json
```

### Options

| Field | Type | Default | Description |
|---|---|---|---|
| `defaultModel` | `string` | `"gemma4:31b-cloud"` | The vision-capable model the subagent uses |
| `systemPrompt` | `string` | *(see source)* | System prompt sent to the subagent |
| `maxImagesPerCall` | `number` | `10` | Maximum number of images per single call |

#### Changing the vision model

The default is `gemma4:31b-cloud`. Change it to any model your Pi installation can access that supports image input:

```json
{
  "defaultModel": "anthropic/claude-sonnet-4"
}
```

```json
{
  "defaultModel": "google/gemini-2.0-flash"
}
```

Run `pi --list-models` to see what's available. The model must support image input (`input: ["text", "image"]`).

#### Tuning the system prompt

The default system prompt is strict — it tells the subagent to read every image, answer the question, and do nothing else. If you want the subagent to take a different approach (e.g., be more creative, focus on specific aspects, output in a particular format), override it:

```json
{
  "defaultModel": "gemma4:31b-cloud",
  "systemPrompt": "You describe images for a visually impaired user. Be thorough and empathetic. Read every image file before describing it."
}
```

The config is reloaded at the start of each Pi session, so you can edit it while Pi is running and it takes effect on the next `/new` or restart.

## The Tool

### `analyze_image`

**Parameters:**

| Parameter | Type | Required | Description |
|---|---|---|---|
| `images` | `string[]` | **yes** | One or more local file paths to images |
| `question` | `string` | **yes** | What you want to know about the image(s) |
| `model` | `string` | no | Override the configured default model for this call |

**Supported formats:** PNG, JPG, JPEG, GIF, WebP, BMP

**Returns:** Plain text — the vision model's answer to your question.

## Usage

You don't call `analyze_image` yourself. The LLM calls it when it needs to understand an image. Your job is to give the agent a reason to look at an image.

### Pasting from clipboard

1. Copy an image to your clipboard (screenshot, browser, etc.)
2. In the Pi TUI, press **Ctrl+V** — Pi saves the image to a temp file and inserts the path
3. Ask your question:

```
I just pasted a screenshot. What UI components are visible?
```

The agent will call `analyze_image` with the temp file path and your question.

### Referencing a file directly

```
What text is shown in /home/me/Desktop/error-message.png?
```

```
Is there anything unusual about the layout in ./mockup-v3.webp compared to a typical settings page?
```

### Multiple images at once

The agent can batch multiple images in a single call — all images go to the same subagent, so the vision model can compare and cross-reference them:

```
Compare the two screenshots in /tmp/before.png and /tmp/after.png — what changed?
```

The LLM calls:

```json
{
  "images": ["/tmp/before.png", "/tmp/after.png"],
  "question": "What changed between these two screenshots?"
}
```

### Overriding the model for one call

You can tell the agent to use a different model for a specific analysis:

```
Use analyze_image with the model anthropic/claude-sonnet-4 to read the chart in /tmp/revenue-q4.png and tell me the top 3 quarters.
```

### What the agent sees in the TUI

**During the call:**

```
analyze_image 2 images
  What changed between these two screenshots?
  📄 before.png
  📄 after.png
```

**When the result comes back:**

```
✓ analyze_image (gemma4:31b-cloud) 3.2s
The two screenshots show a settings page. In the "after" version,
the navigation sidebar has been reorganized — the "Account" section
was moved above "Privacy" and a new "Notifications" entry was added...
```

Press **Ctrl+O** to expand the full output if it's long.

## Requirements

- [Pi](https://github.com/badlogic/pi-mono) coding agent installed and working
- At least one vision-capable model configured in Pi (the default is `gemma4:31b-cloud` — change this if you don't have it)
- The vision model's API key must be set (e.g., `GEMINI_API_KEY`, `ANTHROPIC_API_KEY`, etc.)

## Limitations

- **Local files only** — the tool accepts file paths on your machine, not URLs. If you need to analyze a remote image, download it first.
- **No resize** — images are passed through at original resolution. Very large images may hit context limits in the vision model.
- **Two-turn subagent** — the subagent first calls `read` on each image, then answers the question. This means it uses two LLM turns per call.
- **The subagent must call `read`** — the vision model needs to be smart enough to follow the system prompt's instruction to read the image files before answering. This works well with most capable vision models, but a very weak model might skip the read step and hallucinate.
- **`--tools read`** — the subagent only has the `read` tool. It cannot run bash commands, edit files, or do anything else. This is a security feature, not a bug.

## Troubleshooting

| Problem | Solution |
|---|---|
| Extension doesn't show in startup | Check the file is at `~/.pi/agent/extensions/analyze-image/index.ts` and run `/reload` |
| `analyze_image` tool not available | Pi only shows tools for models that support tools. Make sure your model supports tool use |
| "Image file not found" | Use absolute paths, or paths relative to where you launched `pi` |
| "Subagent failed with no output" | The vision model likely isn't configured. Check your API key and run `pi --list-models` |
| Subagent doesn't read the images | The vision model may be too weak to follow the system prompt. Try a more capable model |
| Analysis takes a long time | Large images + slow model = wait. You can Ctrl+C to abort mid-analysis |

## Project Structure

```
analyze-image/
├── index.ts              # Extension source code
├── config.example.json   # Starter config — copy to ~/.pi/agent/extensions/analyze-image/
└── README.md             # This file
```

## License

MIT