# runEvals

The `runEvals` function enables batch evaluation of agents and workflows by running multiple test cases against scorers concurrently. This is essential for systematic testing, performance analysis, and validation of AI systems.

## Usage example

```typescript
import { runEvals } from '@mastra/core/evals'
import { myAgent } from './agents/my-agent'
import { myScorer1, myScorer2 } from './scorers'

const result = await runEvals({
  target: myAgent,
  data: [
    { input: 'What is machine learning?' },
    { input: 'Explain neural networks' },
    { input: 'How does AI work?' },
  ],
  scorers: [myScorer1, myScorer2],
  targetOptions: { maxSteps: 5 },
  concurrency: 2,
  onItemComplete: ({ item, targetResult, scorerResults }) => {
    console.log(`Completed: ${item.input}`)
    console.log(`Scores:`, scorerResults)
  },
})

console.log(`Average scores:`, result.scores)
console.log(`Processed ${result.summary.totalItems} items`)
```

## Parameters

**target** (`Agent | Workflow`): The agent or workflow to evaluate.

**data** (`RunEvalsDataItem[]`): Array of test cases with input data and optional ground truth.

**scorers** (`MastraScorer[] | AgentScorerConfig | WorkflowScorerConfig`): Scorers to use. A flat array applies all scorers to the raw output. For agents, an \`AgentScorerConfig\` object separates agent-level and trajectory scorers. For workflows, a \`WorkflowScorerConfig\` object specifies scorers for the workflow, individual steps, and trajectory.

**targetOptions** (`AgentExecutionOptions | WorkflowRunOptions`): Options forwarded to the target during execution. For agents: options passed to agent.generate() (e.g. maxSteps, modelSettings, instructions). For workflows: options passed to run.start() (e.g. perStep, outputOptions, initialState).

**concurrency** (`number`): Number of test cases to run concurrently. (Default: `1`)

**onItemComplete** (`function`): Callback function called after each test case completes. Receives item, target result, and scorer results.

## Data item structure

**input** (`string | string[] | CoreMessage[] | any`): Input data for the target. For agents: messages or strings. For workflows: workflow input data.

**groundTruth** (`any`): Expected or reference output for comparison during scoring.

**expectedTrajectory** (`TrajectoryExpectation`): Expected trajectory configuration for trajectory scoring. Includes expected steps, ordering, efficiency budgets, blacklists, and tool failure tolerance. Passed to trajectory scorers as \`run.expectedTrajectory\`. Overrides the static defaults in scorer constructors.

**requestContext** (`RequestContext`): Request Context to pass to the target during execution.

**tracingContext** (`TracingContext`): Tracing context for observability and debugging.

**startOptions** (`WorkflowRunOptions`): Per-item workflow run options (e.g. initialState, perStep, outputOptions). Merged on top of targetOptions, so per-item values take precedence. Only applicable when the target is a workflow.

## Agent scorer configuration

For agents, use `AgentScorerConfig` to separate agent-level scorers from trajectory scorers:

**agent** (`MastraScorer[]`): Scorers that receive the raw agent output (MastraDBMessage\[]). Use for evaluating response quality, content, etc.

**trajectory** (`MastraScorer[]`): Scorers that receive a pre-extracted Trajectory object. When storage is configured, the pipeline extracts a hierarchical trajectory from observability traces (including nested tool calls and model generations). Otherwise, it falls back to extracting tool calls from agent messages.

## Workflow scorer configuration

For workflows, use `WorkflowScorerConfig` to specify scorers at different levels:

**workflow** (`MastraScorer[]`): Scorers to evaluate the entire workflow output.

**steps** (`Record<string, MastraScorer[]>`): Object mapping step IDs to arrays of scorers for evaluating individual step outputs.

**trajectory** (`MastraScorer[]`): Scorers that receive a pre-extracted Trajectory from the workflow execution. When storage is configured, the pipeline extracts a hierarchical trajectory from observability traces (including nested agent runs and tool calls within workflow steps). Otherwise, it falls back to extracting step results from the workflow output.

## Returns

**scores** (`Record<string, any>`): Average scores across all test cases, organized by scorer name.

**summary** (`object`): Summary information about the experiment execution.

**summary.totalItems** (`number`): Total number of test cases processed.

## Examples

### Agent Evaluation

```typescript
import { createScorer, runEvals } from '@mastra/core/evals'

const myScorer = createScorer({
  id: 'my-scorer',
  description: "Check if Agent's response contains ground truth",
  type: 'agent',
}).generateScore(({ run }) => {
  const response = run.output[0]?.content || ''
  const expectedResponse = run.groundTruth
  return response.includes(expectedResponse) ? 1 : 0
})

const result = await runEvals({
  target: chatAgent,
  data: [
    {
      input: 'What is AI?',
      groundTruth: 'AI is a field of computer science that creates intelligent machines.',
    },
    {
      input: 'How does machine learning work?',
      groundTruth: 'Machine learning uses algorithms to learn patterns from data.',
    },
  ],
  scorers: [relevancyScorer],
  concurrency: 3,
})
```

### Agent trajectory evaluation

Use `AgentScorerConfig` to evaluate both the agent response and its tool-calling trajectory:

```typescript
import { runEvals } from '@mastra/core/evals'
import { createTrajectoryAccuracyScorerCode } from '@mastra/evals/scorers/code/trajectory'

const trajectoryScorer = createTrajectoryAccuracyScorerCode()

const result = await runEvals({
  target: chatAgent,
  data: [
    {
      input: 'What is the weather in London?',
      expectedTrajectory: {
        steps: [{ stepType: 'tool_call', name: 'weatherTool' }],
      },
    },
  ],
  scorers: {
    // agent: [responseQualityScorer], // Optional: add agent-level scorers
    trajectory: [trajectoryScorer],
  },
})

// result.scores.agent — average agent-level scores
// result.scores.trajectory — average trajectory scores
```

### Agent with `targetOptions`

Pass execution options like `maxSteps` or `modelSettings` to customize agent behavior during evaluation:

```typescript
const result = await runEvals({
  target: chatAgent,
  data: [{ input: 'Summarize this article' }, { input: 'Translate to French' }],
  scorers: [relevancyScorer],
  targetOptions: {
    maxSteps: 5,
    modelSettings: { temperature: 0 },
  },
})
```

### Workflow Evaluation

```typescript
const workflowResult = await runEvals({
  target: myWorkflow,
  data: [
    { input: { query: 'Process this data', priority: 'high' } },
    { input: { query: 'Another task', priority: 'low' } },
  ],
  scorers: {
    workflow: [outputQualityScorer],
    steps: {
      'validation-step': [validationScorer],
      'processing-step': [processingScorer],
    },
  },
  onItemComplete: ({ item, targetResult, scorerResults }) => {
    console.log(`Workflow completed for: ${item.inputData.query}`)
    if (scorerResults.workflow) {
      console.log('Workflow scores:', scorerResults.workflow)
    }
    if (scorerResults.steps) {
      console.log('Step scores:', scorerResults.steps)
    }
  },
})
```

### Workflow trajectory evaluation

Add trajectory scoring to workflow evaluations to validate step execution order:

```typescript
const workflowResult = await runEvals({
  target: myWorkflow,
  data: [
    {
      input: { query: 'Process this data' },
      expectedTrajectory: {
        steps: [
          { stepType: 'workflow_step', name: 'validate' },
          { stepType: 'workflow_step', name: 'process' },
          { stepType: 'workflow_step', name: 'output' },
        ],
      },
    },
  ],
  scorers: {
    workflow: [outputQualityScorer],
    steps: {
      validate: [validationScorer],
    },
    trajectory: [trajectoryScorer],
  },
})

// result.scores.trajectory — workflow trajectory scores
```

### Workflow with per-item `startOptions`

Use `startOptions` on individual data items to customize each workflow run. Per-item values take precedence over `targetOptions`:

```typescript
const result = await runEvals({
  target: myWorkflow,
  data: [
    {
      input: { query: 'hello' },
      startOptions: { initialState: { counter: 1 } },
    },
    {
      input: { query: 'world' },
      startOptions: { initialState: { counter: 2 } },
    },
  ],
  scorers: [outputQualityScorer],
  targetOptions: { perStep: true },
})
```

## Related

- [createScorer()](https://mastra.ai/reference/evals/create-scorer): Create custom scorers for experiments
- [MastraScorer](https://mastra.ai/reference/evals/mastra-scorer): Learn about scorer structure and methods
- [Trajectory Accuracy](https://mastra.ai/reference/evals/trajectory-accuracy): Built-in trajectory evaluation scorers
- [Scorer Utilities](https://mastra.ai/reference/evals/scorer-utils): Helper functions for extracting trajectory data
- [Custom Scorers](https://mastra.ai/docs/evals/custom-scorers): Guide to building evaluation logic
- [Scorers Overview](https://mastra.ai/docs/evals/overview): Understanding scorer concepts