# Scorer utils

Mastra provides utility functions to help extract and process data from scorer run inputs and outputs. These utilities are particularly useful in the `preprocess` step of custom scorers.

## Import

```typescript
import {
  getAssistantMessageFromRunOutput,
  getReasoningFromRunOutput,
  getUserMessageFromRunInput,
  getSystemMessagesFromRunInput,
  getCombinedSystemPrompt,
  extractToolCalls,
  extractInputMessages,
  extractAgentResponseMessages,
  compareTrajectories,
  createTrajectoryTestRun,
} from '@mastra/evals/scorers/utils'
```

Trajectory extraction functions are available from `@mastra/core/evals`:

```typescript
import {
  extractTrajectory,
  extractWorkflowTrajectory,
  extractTrajectoryFromTrace,
} from '@mastra/core/evals'
```

## Message extraction

### `getAssistantMessageFromRunOutput`

Extracts the text content from the first assistant message in the run output.

```typescript
const scorer = createScorer({
  id: 'my-scorer',
  description: 'My scorer',
  type: 'agent',
})
  .preprocess(({ run }) => {
    const response = getAssistantMessageFromRunOutput(run.output)
    return { response }
  })
  .generateScore(({ results }) => {
    return results.preprocessStepResult?.response ? 1 : 0
  })
```

**output** (`ScorerRunOutputForAgent`): The scorer run output (array of MastraDBMessage)

**Returns:** `string | undefined` - The assistant message text, or undefined if no assistant message is found.

### `getUserMessageFromRunInput`

Extracts the text content from the first user message in the run input.

```typescript
.preprocess(({ run }) => {
  const userMessage = getUserMessageFromRunInput(run.input);
  return { userMessage };
})
```

**input** (`ScorerRunInputForAgent`): The scorer run input containing input messages

**Returns:** `string | undefined` - The user message text, or undefined if no user message is found.

### `extractInputMessages`

Extracts text content from all input messages as an array.

```typescript
.preprocess(({ run }) => {
  const allUserMessages = extractInputMessages(run.input);
  return { conversationHistory: allUserMessages.join("\n") };
})
```

**Returns:** `string[]` - Array of text strings from each input message.

### `extractAgentResponseMessages`

Extracts text content from all assistant response messages as an array.

```typescript
.preprocess(({ run }) => {
  const allResponses = extractAgentResponseMessages(run.output);
  return { allResponses };
})
```

**Returns:** `string[]` - Array of text strings from each assistant message.

## Reasoning extraction

### `getReasoningFromRunOutput`

Extracts reasoning text from the run output. This is particularly useful when evaluating responses from reasoning models like `deepseek-reasoner` that produce chain-of-thought reasoning.

Reasoning can be stored in two places:

1. `content.reasoning` - a string field on the message content
2. `content.parts` - as parts with `type: 'reasoning'` containing `details`

```typescript
import {
  getReasoningFromRunOutput,
  getAssistantMessageFromRunOutput,
} from '@mastra/evals/scorers/utils'

const reasoningQualityScorer = createScorer({
  id: 'reasoning-quality',
  name: 'Reasoning Quality',
  description: 'Evaluates the quality of model reasoning',
  type: 'agent',
})
  .preprocess(({ run }) => {
    const reasoning = getReasoningFromRunOutput(run.output)
    const response = getAssistantMessageFromRunOutput(run.output)
    return { reasoning, response }
  })
  .analyze(({ results }) => {
    const { reasoning } = results.preprocessStepResult || {}
    return {
      hasReasoning: !!reasoning,
      reasoningLength: reasoning?.length || 0,
      hasStepByStep: reasoning?.includes('step') || false,
    }
  })
  .generateScore(({ results }) => {
    const { hasReasoning, reasoningLength } = results.analyzeStepResult || {}
    if (!hasReasoning) return 0
    // Score based on reasoning length (normalized to 0-1)
    return Math.min(reasoningLength / 500, 1)
  })
  .generateReason(({ results, score }) => {
    const { hasReasoning, reasoningLength } = results.analyzeStepResult || {}
    if (!hasReasoning) {
      return 'No reasoning was provided by the model.'
    }
    return `Model provided ${reasoningLength} characters of reasoning. Score: ${score}`
  })
```

**output** (`ScorerRunOutputForAgent`): The scorer run output (array of MastraDBMessage)

**Returns:** `string | undefined` - The reasoning text, or undefined if no reasoning is present.

## System message extraction

### `getSystemMessagesFromRunInput`

Extracts all system messages from the run input, including both standard system messages and tagged system messages (specialized prompts like memory instructions).

```typescript
.preprocess(({ run }) => {
  const systemMessages = getSystemMessagesFromRunInput(run.input);
  return {
    systemPromptCount: systemMessages.length,
    systemPrompts: systemMessages
  };
})
```

**Returns:** `string[]` - Array of system message strings.

### `getCombinedSystemPrompt`

Combines all system messages into a single prompt string, joined with double newlines.

```typescript
.preprocess(({ run }) => {
  const fullSystemPrompt = getCombinedSystemPrompt(run.input);
  return { fullSystemPrompt };
})
```

**Returns:** `string` - Combined system prompt string.

## Tool call extraction

### `extractToolCalls`

Extracts information about all tool calls from the run output, including tool names, call IDs, and their positions in the message array.

```typescript
const toolUsageScorer = createScorer({
  id: 'tool-usage',
  description: 'Evaluates tool usage patterns',
  type: 'agent',
})
  .preprocess(({ run }) => {
    const { tools, toolCallInfos } = extractToolCalls(run.output)
    return {
      toolsUsed: tools,
      toolCount: tools.length,
      toolDetails: toolCallInfos,
    }
  })
  .generateScore(({ results }) => {
    const { toolCount } = results.preprocessStepResult || {}
    // Score based on appropriate tool usage
    return toolCount > 0 ? 1 : 0
  })
```

**Returns:**

```typescript
{
  tools: string[];           // Array of tool names
  toolCallInfos: ToolCallInfo[];  // Detailed tool call information
}
```

Where `ToolCallInfo` is:

```typescript
type ToolCallInfo = {
  toolName: string // Name of the tool
  toolCallId: string // Unique call identifier
  messageIndex: number // Index in the output array
  invocationIndex: number // Index within message's tool invocations
}
```

## Test utilities

These utilities help create test data for scorer development.

### `createTestMessage`

Creates a `MastraDBMessage` object for testing purposes.

```typescript
import { createTestMessage } from '@mastra/evals/scorers/utils'

const userMessage = createTestMessage({
  content: 'What is the weather?',
  role: 'user',
})

const assistantMessage = createTestMessage({
  content: 'The weather is sunny.',
  role: 'assistant',
  toolInvocations: [
    {
      toolCallId: 'call-1',
      toolName: 'weatherTool',
      args: { location: 'London' },
      result: { temp: 20 },
      state: 'result',
    },
  ],
})
```

### `createAgentTestRun`

Creates a complete test run object for testing scorers.

```typescript
import { createAgentTestRun, createTestMessage } from '@mastra/evals/scorers/utils'

const testRun = createAgentTestRun({
  inputMessages: [createTestMessage({ content: 'Hello', role: 'user' })],
  output: [createTestMessage({ content: 'Hi there!', role: 'assistant' })],
})

// Run your scorer with the test data
const result = await myScorer.run({
  input: testRun.input,
  output: testRun.output,
})
```

## Trajectory utilities

### `extractTrajectory`

Extracts a `Trajectory` from agent output messages (`MastraDBMessage[]`). Converts tool invocations into `ToolCallStep` objects. The `runEvals` pipeline calls this automatically for trajectory scorers — you only need it for direct testing.

Available from `@mastra/core/evals`.

```typescript
import { extractTrajectory } from '@mastra/core/evals'

const trajectory = extractTrajectory(agentOutputMessages)
// trajectory.steps — ToolCallStep[] extracted from toolInvocations
// trajectory.rawOutput — the original MastraDBMessage[] array
```

**Returns:** `Trajectory` — Contains `steps: TrajectoryStep[]`, `totalDurationMs`, and `rawOutput`.

### `extractWorkflowTrajectory`

Extracts a `Trajectory` from workflow step results. Converts `StepResult` records into `WorkflowStepStep` objects, respecting the execution path ordering.

Available from `@mastra/core/evals`.

```typescript
import { extractWorkflowTrajectory } from '@mastra/core/evals'

const trajectory = extractWorkflowTrajectory(
  workflowResult.steps, // Record<string, StepResult>
  workflowResult.stepExecutionPath, // string[] (optional)
)
// trajectory.steps — WorkflowStepStep[] in execution order
```

**Returns:** `Trajectory` — Contains `steps: TrajectoryStep[]`, `totalDurationMs`, and `rawWorkflowResult`.

### `extractTrajectoryFromTrace`

Builds a hierarchical `Trajectory` from observability trace spans (`SpanRecord[]`). Reconstructs the parent-child span tree and maps each span to the appropriate `TrajectoryStep` discriminated union type with nested `children`.

This is the preferred extraction method when storage is available. The `runEvals` pipeline calls this automatically when the target's `Mastra` instance has a configured storage backend. It produces richer trajectories than `extractTrajectory` or `extractWorkflowTrajectory` because it captures the full execution tree, including nested agent runs, tool calls, and model generations.

Available from `@mastra/core/evals`.

```typescript
import { extractTrajectoryFromTrace } from '@mastra/core/evals'

// After fetching a trace from the observability store
const traceData = await observabilityStore.getTrace({ traceId })
const trajectory = extractTrajectoryFromTrace(traceData.spans, rootSpanId)
// trajectory.steps — hierarchical TrajectoryStep[] with children
```

**Parameters:**

- `spans` (`SpanRecord[]`): Array of span records from a trace query.
- `rootSpanId` (`string`, optional): Span ID to use as the starting point. When omitted, uses spans with no parent.

**Returns:** `Trajectory`: Contains `steps: TrajectoryStep[]` with recursive `children` and `totalDurationMs`.

#### Span type mapping

| Span type              | Trajectory step type   | Key fields extracted                                          |
| ---------------------- | ---------------------- | ------------------------------------------------------------- |
| `TOOL_CALL`            | `tool_call`            | `toolArgs`, `toolResult`, `success`                           |
| `MCP_TOOL_CALL`        | `mcp_tool_call`        | `toolArgs`, `toolResult`, `mcpServer`, `success`              |
| `MODEL_GENERATION`     | `model_generation`     | `modelId`, `promptTokens`, `completionTokens`, `finishReason` |
| `AGENT_RUN`            | `agent_run`            | `agentId` (from entity ID)                                    |
| `WORKFLOW_RUN`         | `workflow_run`         | `workflowId` (from entity ID)                                 |
| `WORKFLOW_STEP`        | `workflow_step`        | `output`                                                      |
| `WORKFLOW_CONDITIONAL` | `workflow_conditional` | `conditionCount`, `selectedSteps`                             |
| `WORKFLOW_PARALLEL`    | `workflow_parallel`    | `branchCount`, `parallelSteps`                                |
| `WORKFLOW_LOOP`        | `workflow_loop`        | `loopType`, `totalIterations`                                 |
| `WORKFLOW_SLEEP`       | `workflow_sleep`       | `sleepDurationMs`, `sleepType`                                |
| `WORKFLOW_WAIT_EVENT`  | `workflow_wait_event`  | `eventName`, `eventReceived`                                  |
| `PROCESSOR_RUN`        | `processor_run`        | `processorId`                                                 |

Spans with types `GENERIC`, `MODEL_STEP`, `MODEL_CHUNK`, and `WORKFLOW_CONDITIONAL_EVAL` are skipped as noise.

### `compareTrajectories`

Compares an actual trajectory against an expected trajectory and returns a detailed comparison result. Used internally by `createTrajectoryAccuracyScorerCode`.

The `expected` parameter accepts either a `Trajectory` (actual trajectory) or `{ steps: ExpectedStep[] }`. When using `ExpectedStep[]`, you can match by name only, name + stepType, or include data for comparison. See [Expected steps](https://mastra.ai/reference/evals/trajectory-accuracy) for details.

```typescript
import { compareTrajectories } from '@mastra/evals/scorers/utils'

// Using ExpectedStep[] (recommended for expectations)
// Data fields (e.g. toolArgs) are auto-compared when present on expected steps
const result = compareTrajectories(
  actualTrajectory,
  { steps: [{ name: 'search' }, { name: 'summarize', stepType: 'tool_call' }] },
  { allowRepeatedSteps: true },
)
// result.score — 0.0 to 1.0
// result.missingSteps — step names not found
// result.extraSteps — unexpected step names
// result.outOfOrderSteps — steps found but in wrong order
```

**Returns:** `TrajectoryComparisonResult`

### `createTrajectoryTestRun`

Creates a test run object for trajectory scorers. Wraps a `Trajectory` into the expected `ScorerRun` format.

```typescript
import { createTrajectoryTestRun } from '@mastra/evals/scorers/utils'

const run = createTrajectoryTestRun({
  steps: [
    { stepType: 'tool_call', name: 'search', toolArgs: { q: 'test' } },
    { stepType: 'tool_call', name: 'summarize' },
  ],
})

const result = await trajectoryScorer.run(run)
```

### `checkTrajectoryEfficiency`

Evaluates trajectory efficiency against step, token, and duration budgets. Also detects redundant calls (same tool with same arguments).

```typescript
import { checkTrajectoryEfficiency } from '@mastra/evals/scorers/utils'

const result = checkTrajectoryEfficiency(trajectory, {
  maxSteps: 5,
  maxTotalTokens: 2000,
  maxTotalDurationMs: 5000,
  noRedundantCalls: true,
})
// result.score — 1.0 if within all budgets, lower with penalties
// result.redundantCalls — duplicate tool+args combos
// result.overStepBudget — true if maxSteps exceeded
// result.overTokenBudget — true if maxTotalTokens exceeded
// result.overDurationBudget — true if maxTotalDurationMs exceeded
```

**Returns:** `TrajectoryEfficiencyResult`

### `checkTrajectoryBlacklist`

Checks whether a trajectory contains forbidden tools or tool call sequences.

```typescript
import { checkTrajectoryBlacklist } from '@mastra/evals/scorers/utils'

const result = checkTrajectoryBlacklist(trajectory, {
  blacklistedTools: ['deleteAll', 'admin-override'],
  blacklistedSequences: [['escalate', 'admin-override']],
})
// result.score — 1.0 if no violations, 0.0 if any found
// result.violatedTools — blacklisted tools that were called
// result.violatedSequences — blacklisted sequences that were detected
```

**Returns:** `TrajectoryBlacklistResult`

### `analyzeToolFailures`

Detects tool failure patterns including retries, fallbacks, and argument corrections.

```typescript
import { analyzeToolFailures } from '@mastra/evals/scorers/utils'

const result = analyzeToolFailures(trajectory, {
  maxRetriesPerTool: 2,
})
// result.score — 1.0 if no failure patterns, lower if patterns detected
// result.patterns — detected patterns (retry, fallback, arg_correction)
```

**Returns:** `ToolFailureAnalysisResult`

## Complete example

Here's a complete example showing how to use multiple utilities together:

```typescript
import { createScorer } from '@mastra/core/evals'
import {
  getAssistantMessageFromRunOutput,
  getReasoningFromRunOutput,
  getUserMessageFromRunInput,
  getCombinedSystemPrompt,
  extractToolCalls,
} from '@mastra/evals/scorers/utils'

const comprehensiveScorer = createScorer({
  id: 'comprehensive-analysis',
  name: 'Comprehensive Analysis',
  description: 'Analyzes all aspects of an agent response',
  type: 'agent',
})
  .preprocess(({ run }) => {
    // Extract all relevant data
    const userMessage = getUserMessageFromRunInput(run.input)
    const response = getAssistantMessageFromRunOutput(run.output)
    const reasoning = getReasoningFromRunOutput(run.output)
    const systemPrompt = getCombinedSystemPrompt(run.input)
    const { tools, toolCallInfos } = extractToolCalls(run.output)

    return {
      userMessage,
      response,
      reasoning,
      systemPrompt,
      toolsUsed: tools,
      toolCount: tools.length,
    }
  })
  .generateScore(({ results }) => {
    const { response, reasoning, toolCount } = results.preprocessStepResult || {}

    let score = 0
    if (response && response.length > 0) score += 0.4
    if (reasoning) score += 0.3
    if (toolCount > 0) score += 0.3

    return score
  })
  .generateReason(({ results, score }) => {
    const { response, reasoning, toolCount } = results.preprocessStepResult || {}

    const parts = []
    if (response) parts.push('provided a response')
    if (reasoning) parts.push('included reasoning')
    if (toolCount > 0) parts.push(`used ${toolCount} tool(s)`)

    return `Score: ${score}. The agent ${parts.join(', ')}.`
  })
```