---
title: Scorers
description: How Smithers evaluates task outputs using built-in and custom scorers with live and batch scoring modes.
---

Smithers ships a scoring system that lets you attach evaluation functions to tasks. Scorers run after a task finishes and produce a numeric score between 0 and 1, an optional human-readable reason, and optional metadata.

Scores are persisted in SQLite alongside your run data so you can query, aggregate, and visualize quality over time.

## Core Concepts

### Scorer

A scorer is a named function that takes a `ScorerInput` and returns a `ScoreResult`:

```ts
import { createScorer } from "smithers-orchestrator/scorers";

const myScorer = createScorer({
  id: "length-check",
  name: "Length Check",
  description: "Checks output meets minimum length",
  score: async ({ output }) => {
    const text = String(output);
    const score = Math.min(text.length / 500, 1);
    return { score, reason: `Output is ${text.length} chars` };
  },
});
```

### ScoreResult

Every scorer returns a `ScoreResult`:

| Field    | Type                        | Description                              |
|----------|-----------------------------|------------------------------------------|
| `score`  | `number` (0-1)              | Normalized quality score                 |
| `reason` | `string?`                   | Human-readable explanation               |
| `meta`   | `Record<string, unknown>?`  | Arbitrary metadata for downstream use    |

### ScorerInput

The input passed to every scorer function:

| Field          | Type              | Description                                |
|----------------|-------------------|--------------------------------------------|
| `input`        | `unknown`         | The original task input/prompt             |
| `output`       | `unknown`         | The task's produced output                 |
| `groundTruth`  | `unknown?`        | Expected output for comparison             |
| `context`      | `unknown?`        | Additional context (e.g. retrieved docs)   |
| `latencyMs`    | `number?`         | How long the task took in milliseconds     |
| `outputSchema` | `ZodObject?`      | The Zod schema the output should match     |

## Attaching Scorers to Tasks

Pass a `scorers` map to any `<Task>` component:

```tsx
import { latencyScorer, schemaAdherenceScorer } from "smithers-orchestrator/scorers";

<Task
  id="analyze"
  agent={claude}
  output={outputs.analysis}
  scorers={{
    latency: { scorer: latencyScorer({ targetMs: 5000, maxMs: 30000 }) },
    schema: { scorer: schemaAdherenceScorer() },
  }}
>
  Analyze the codebase and produce a summary.
</Task>
```

Scorers fire asynchronously after the task finishes. They never block the workflow.

## Sampling

Not every run needs every scorer. Use sampling to control evaluation frequency:

```tsx
scorers={{
  relevancy: {
    scorer: relevancyScorer(judge),
    sampling: { type: "ratio", rate: 0.1 },  // 10% of runs
  },
  schema: {
    scorer: schemaAdherenceScorer(),
    sampling: { type: "all" },  // every run (default)
  },
}}
```

| Sampling Type | Behavior                        |
|---------------|---------------------------------|
| `all`         | Run on every task execution     |
| `ratio`       | Run with probability `rate`     |
| `none`        | Disabled (useful for toggling)  |

## Custom Scorers

### createScorer

Build a scorer from a plain configuration object:

```ts
import { createScorer } from "smithers-orchestrator/scorers";

const myScorer = createScorer({
  id: "word-count",
  name: "Word Count",
  description: "Scores based on output word count",
  score: async ({ output }) => {
    const words = String(output).split(/\s+/).length;
    return { score: Math.min(words / 200, 1), reason: `${words} words` };
  },
});
```

### llmJudge

Build an LLM-as-judge scorer that delegates evaluation to an AI agent. The judge receives a prompt constructed from `promptTemplate` and is expected to return JSON with `score` (0–1) and optional `reason`. If the response cannot be parsed, the scorer returns 0 with a diagnostic reason.

```ts
import { llmJudge } from "smithers-orchestrator/scorers";

const toneScorer = llmJudge({
  id: "professional-tone",
  name: "Professional Tone",
  description: "Evaluates if the output maintains a professional tone",
  judge,
  instructions: "You evaluate whether text maintains a professional, business-appropriate tone.",
  promptTemplate: ({ input, output }) =>
    `Rate the professionalism of this response (0-1 JSON).\n\nInput: ${String(input)}\n\nOutput: ${String(output)}`,
});
```

| Field             | Type                            | Description                                                    |
|-------------------|---------------------------------|----------------------------------------------------------------|
| `id`              | `string`                        | Unique scorer identifier                                       |
| `name`            | `string`                        | Human-readable name                                            |
| `description`     | `string`                        | What this scorer evaluates                                     |
| `judge`           | `AgentLike`                     | The agent that performs the evaluation                          |
| `instructions`    | `string`                        | System-level instructions prepended to the prompt              |
| `promptTemplate`  | `(input: ScorerInput) => string`| Builds the prompt from the scorer input                        |

## Built-in Scorers

Smithers includes five built-in scorers:

### Code-based (no LLM needed)

**`schemaAdherenceScorer()`** — Validates that the output conforms to the task's Zod `outputSchema`. Returns 1.0 if `safeParse` succeeds, 0.0 if it fails (with validation issues in the reason). If no `outputSchema` is set, returns 1.0 with a skip note.

**`latencyScorer({ targetMs, maxMs })`** — Scores based on task execution time. Returns 1.0 at or below `targetMs`, linearly interpolates to 0.0 at `maxMs`, and returns 0.0 above `maxMs`. If no latency data is available, returns 1.0 with a skip note.

### LLM-based (requires a judge agent)

All three LLM-based scorers accept an `AgentLike` as `judge`. They construct a prompt with evaluation criteria, call `judge.generate()`, and parse the JSON response.

**`relevancyScorer(judge)`** — Evaluates whether the output is relevant to and addresses the input prompt. Considers both direct answers and related context. Scores from 0.0 (completely irrelevant) to 1.0 (perfectly relevant).

**`toxicityScorer(judge)`** — Detects toxic, harmful, offensive, or inappropriate content. Checks for hate speech, harassment, threats, discriminatory language, explicit content, and dangerous instructions. The score represents the *level* of toxicity: 0.0 means clean, 1.0 means highly toxic.

**`faithfulnessScorer(judge)`** — Checks whether the output is faithful to the provided `context` without hallucinations. Every claim in the output should be supported by the context. Scores from 0.0 (entirely fabricated) to 1.0 (completely faithful). If no context is provided, evaluates internal consistency.

## Persistence

All scores are stored in the `_smithers_scorers` table:

| Column         | Type    | Description                           |
|----------------|---------|---------------------------------------|
| `id`           | TEXT    | Unique score row ID                   |
| `run_id`       | TEXT    | Parent run                            |
| `node_id`      | TEXT    | Task that was scored                  |
| `iteration`    | INTEGER | Task iteration                        |
| `attempt`      | INTEGER | Task attempt number                   |
| `scorer_id`    | TEXT    | Scorer identifier                     |
| `scorer_name`  | TEXT    | Human-readable scorer name            |
| `source`       | TEXT    | `live` or `batch`                     |
| `score`        | REAL    | The 0-1 score                         |
| `reason`       | TEXT    | Optional explanation                  |
| `meta_json`    | TEXT    | JSON metadata                         |
| `input_json`   | TEXT    | Serialized scorer input               |
| `output_json`  | TEXT    | Serialized task output                |
| `latency_ms`   | REAL    | Task execution latency                |
| `scored_at_ms` | INTEGER | When the score was computed           |
| `duration_ms`  | REAL    | How long the scorer itself took       |

## Execution Modes

### Async (live scoring)

When scorers are attached to a `<Task>`, they run via `runScorersAsync` — fire-and-forget execution that never blocks the workflow. All scorers run concurrently with unbounded concurrency. Errors are logged but do not fail the task.

### Batch (offline evaluation)

For testing and offline evaluation, call `runScorersBatch` directly. It runs all scorers, waits for completion, and returns a map of key to `ScoreResult | null`:

```ts
import { runScorersBatch } from "smithers-orchestrator/scorers";

const results = await runScorersBatch(
  { schema: { scorer: schemaAdherenceScorer() } },
  { runId: "test", nodeId: "analyze", iteration: 0, attempt: 0, input: "...", output: { summary: "..." } },
  adapter,
);
// { schema: { score: 1, reason: "Output matches schema" } }
```

Both modes persist results to the `_smithers_scorers` table with a `source` column of `"live"` or `"batch"`.

## Aggregation

Query aggregate statistics across runs:

```ts
import { aggregateScores } from "smithers-orchestrator/scorers";

const stats = await aggregateScores(adapter, { runId: "run-123" });
```

### Filter Options

| Filter     | Type     | Description                         |
|------------|----------|-------------------------------------|
| `runId`    | `string` | Filter to a specific run            |
| `nodeId`   | `string` | Filter to a specific task node      |
| `scorerId` | `string` | Filter to a specific scorer         |

All filters are optional and can be combined.

### Returned Statistics

Each entry in the returned array contains:

| Field        | Type     | Description                                      |
|--------------|----------|--------------------------------------------------|
| `scorerId`   | `string` | Scorer identifier                                |
| `scorerName` | `string` | Human-readable scorer name                       |
| `count`      | `number` | Total number of scores                           |
| `mean`       | `number` | Average score                                    |
| `min`        | `number` | Lowest score                                     |
| `max`        | `number` | Highest score                                    |
| `p50`        | `number` | Median score (50th percentile)                   |
| `stddev`     | `number` | Standard deviation (population)                  |

## Events

Three event types are emitted during the scorer lifecycle:

**`ScorerStarted`** — Emitted when a scorer begins evaluation.

| Field        | Type     |
|--------------|----------|
| `scorerId`   | `string` |
| `scorerName` | `string` |
| `nodeId`     | `string` |
| `runId`      | `string` |

**`ScorerFinished`** — Emitted when a scorer completes successfully. Includes the `score` value.

| Field        | Type     |
|--------------|----------|
| `scorerId`   | `string` |
| `scorerName` | `string` |
| `score`      | `number` |
| `nodeId`     | `string` |
| `runId`      | `string` |

**`ScorerFailed`** — Emitted when a scorer throws an error. Includes the `error`.

| Field        | Type      |
|--------------|-----------|
| `scorerId`   | `string`  |
| `scorerName` | `string`  |
| `error`      | `unknown` |
| `nodeId`     | `string`  |
| `runId`      | `string`  |

## Metrics

Smithers tracks four Effect metrics for scorer execution:

| Metric                          | Type      | Description                                  |
|---------------------------------|-----------|----------------------------------------------|
| `smithers.scorers.started`      | Counter   | Incremented when a scorer begins             |
| `smithers.scorers.finished`     | Counter   | Incremented when a scorer completes          |
| `smithers.scorers.failed`       | Counter   | Incremented when a scorer throws             |
| `smithers.scorer.duration_ms`   | Histogram | Scorer execution time (exponential buckets, ~10ms to ~80s) |

These metrics are available through the standard Effect metric system and can be exported via OTLP. See [Monitoring and Logs](/guides/monitoring-logs).

## CLI

View scores from the command line:

```bash
# Show all scores for a run
smithers scores <run_id>

# Show scores for a specific node
smithers scores <run_id> --node analyze
```
