# Running scorers in CI

Running scorers in your CI pipeline provides quantifiable metrics for measuring agent quality over time. The `runEvals` function processes multiple test cases through your agent or workflow and returns aggregate scores.

## Basic setup

You can use any testing framework that supports ESM modules, such as [Vitest](https://vitest.dev/), [Jest](https://jestjs.io/), or [Mocha](https://mochajs.org/).

## Creating test cases

Use `runEvals` to evaluate your agent against multiple test cases. The function accepts an array of data items, each containing an `input` and optional `groundTruth` for scorer validation.

```typescript
import { describe, it, expect } from 'vitest'
import { createScorer, runEvals } from '@mastra/core/evals'
import { weatherAgent } from './weather-agent'
import { locationScorer } from '../scorers/location-scorer'

describe('Weather Agent Tests', () => {
  it('should correctly extract locations from queries', async () => {
    const result = await runEvals({
      data: [
        {
          input: 'weather in Berlin',
          groundTruth: { expectedLocation: 'Berlin', expectedCountry: 'DE' },
        },
        {
          input: 'weather in Berlin, Maryland',
          groundTruth: { expectedLocation: 'Berlin', expectedCountry: 'US' },
        },
        {
          input: 'weather in Berlin, Russia',
          groundTruth: { expectedLocation: 'Berlin', expectedCountry: 'RU' },
        },
      ],
      target: weatherAgent,
      scorers: [locationScorer],
    })

    // Assert aggregate score meets threshold
    expect(result.scores['location-accuracy']).toBe(1)
    expect(result.summary.totalItems).toBe(3)
  })
})
```

## Understanding results

The `runEvals` function returns an object with:

- `scores`: Average scores for each scorer across all test cases
- `summary.totalItems`: Total number of test cases processed

```typescript
{
  scores: {
    'location-accuracy': 1.0,  // Average score across all items
    'another-scorer': 0.85
  },
  summary: {
    totalItems: 3
  }
}
```

## Multiple test scenarios

Create separate test cases for different evaluation scenarios:

```typescript
describe('Weather Agent Tests', () => {
  const locationScorer = createScorer({
    /* ... */
  })

  it('should handle location disambiguation', async () => {
    const result = await runEvals({
      data: [
        {
          input: 'weather in Berlin',
          groundTruth: {
            /* ... */
          },
        },
        {
          input: 'weather in Berlin, Maryland',
          groundTruth: {
            /* ... */
          },
        },
      ],
      target: weatherAgent,
      scorers: [locationScorer],
    })

    expect(result.scores['location-accuracy']).toBe(1)
  })

  it('should handle typos and misspellings', async () => {
    const result = await runEvals({
      data: [
        {
          input: 'weather in Berln',
          groundTruth: { expectedLocation: 'Berlin', expectedCountry: 'DE' },
        },
        {
          input: 'weather in Parris',
          groundTruth: { expectedLocation: 'Paris', expectedCountry: 'FR' },
        },
      ],
      target: weatherAgent,
      scorers: [locationScorer],
    })

    expect(result.scores['location-accuracy']).toBe(1)
  })
})
```

## Next steps

- Learn about [creating custom scorers](https://mastra.ai/docs/evals/custom-scorers)
- Explore [built-in scorers](https://mastra.ai/docs/evals/built-in-scorers)
- Read the [runEvals API reference](https://mastra.ai/reference/evals/run-evals)