---
title: Voice
description: How voice providers bring speech-to-text and text-to-speech into Smithers workflows.
---

Your workflow orchestrates code reviews, generates reports, analyzes data -- all in text. But some tasks start with an audio recording or need to produce spoken output. Maybe you have a meeting transcript to analyze, or you want your pipeline to read results aloud. That is what voice providers are for.

## What Is a Voice Provider?

A voice provider wraps a speech service behind a simple interface. It can do one or more of three things:

1. **Speak** -- convert text to audio (text-to-speech / TTS)
2. **Listen** -- convert audio to text (speech-to-text / STT)
3. **Realtime** -- bidirectional audio streaming over a WebSocket (speech-to-speech)

You pick the provider, configure it once, and hand it to your tasks. Smithers handles the wiring.

```ts
import { createAiSdkVoice } from "smithers-orchestrator/voice";
import { openai } from "@ai-sdk/openai";

const voice = createAiSdkVoice({
  speechModel: openai.speech("tts-1"),
  transcriptionModel: openai.transcription("whisper-1"),
});
```

That single object now speaks and listens. The AI SDK handles the actual API calls; smithers gives you the integration layer.

## The `<Voice>` Component

Wrap a subtree with `<Voice>` and every task inside inherits that voice provider:

```tsx
<Voice provider={voice} speaker="alloy">
  <Task id="transcribe" output={outputs.transcript} agent={myAgent}>
    Transcribe the uploaded audio file.
  </Task>
  <Task id="summarize" output={outputs.summary} agent={myAgent}>
    Summarize the transcript.
  </Task>
</Voice>
```

The `<Voice>` component does not execute anything itself. It annotates the tasks beneath it, the same way `<Worktree>` annotates tasks with a filesystem path or `<Parallel>` annotates them with concurrency limits.

Tasks inside a `<Voice>` scope receive `voice` and `voiceSpeaker` on their descriptors. The engine uses these to call `voice.listen()` when the task needs audio input or `voice.speak()` when it produces audio output.

## Batch vs Realtime

Two fundamentally different modes. Batch is what most people need.

**Batch**: send a blob of text, get a blob of audio back (or vice versa). One request, one response. The AI SDK's `experimental_generateSpeech` and `experimental_transcribe` handle this. It works with OpenAI, ElevenLabs, Deepgram, and others -- any provider the AI SDK supports.

**Realtime**: open a persistent WebSocket, stream audio in both directions simultaneously. OpenAI's Realtime API does this. Latency is low, but the protocol is more complex. Smithers provides `createOpenAIRealtimeVoice()` for this case because the AI SDK does not cover it.

Most workflows should start with batch. Reach for realtime only when you need live conversation.

## Composite Voice

What if you want Deepgram for transcription but ElevenLabs for speech? Composite voice mixes providers:

```ts
import { createCompositeVoice, createAiSdkVoice } from "smithers-orchestrator/voice";

const listener = createAiSdkVoice({
  transcriptionModel: deepgram.transcription("nova-3"),
});
const speaker = createAiSdkVoice({
  speechModel: elevenlabs.speech("eleven_multilingual_v2"),
});

const voice = createCompositeVoice({
  input: listener,
  output: speaker,
});
```

When a task calls `voice.listen()`, it routes to Deepgram. When it calls `voice.speak()`, it routes to ElevenLabs. If you also set a `realtime` provider, it takes priority for both operations.

## Effect Service Layer

For power users who build with Effect.ts directly, voice exposes an Effect service:

```ts
import { VoiceService, speak, listen } from "smithers-orchestrator/voice";
import { Effect } from "effect";

const program = Effect.gen(function* () {
  const transcript = yield* listen(audioStream);
  const audio = yield* speak(`Summary: ${transcript}`);
  return audio;
}).pipe(Effect.provideService(VoiceService, myVoice));
```

The `VoiceService` tag lets you inject a voice provider into any Effect pipeline. The `speak()` and `listen()` functions pull it from context automatically.

For scoped lifecycle management (automatic `connect()` and `close()`), use `createVoiceServiceLayer()`:

```ts
import { createVoiceServiceLayer, speak } from "smithers-orchestrator/voice";
import { Effect, Layer } from "effect";

const voiceLayer = createVoiceServiceLayer(realtimeVoice);

const program = Effect.gen(function* () {
  const audio = yield* speak("Hello from Effect");
  return audio;
}).pipe(Effect.provide(voiceLayer));
```

The layer handles calling `connect()` when the scope opens and `close()` when it closes.

## Listing Available Speakers

Every voice provider exposes `getSpeakers()`, which returns the list of voices that provider supports:

```ts
const speakers = await voice.getSpeakers();
// [{ voiceId: "alloy" }, { voiceId: "echo" }, ...]
```

For the OpenAI Realtime provider, this returns the eight built-in voices: `alloy`, `ash`, `ballad`, `coral`, `echo`, `sage`, `shimmer`, and `verse`.

For composite voice, `getSpeakers()` delegates to the realtime provider if one is set, otherwise to the output (TTS) provider. If neither is configured, it returns an empty array.

## Updating Voice Config at Runtime

You can update voice session parameters after initialization without reconnecting. Call `updateConfig` with any session-level settings the provider understands:

```ts
voice.updateConfig({
  voice: "shimmer",
  turn_detection: { type: "server_vad" },
});
```

For the OpenAI Realtime provider, `updateConfig` sends a `session.update` event over the existing WebSocket. Changes take effect for subsequent interactions in the same session. For composite voice, `updateConfig` delegates to the realtime provider.

## Manually Triggering a Realtime Response

In realtime (speech-to-speech) mode, OpenAI's server can detect speech automatically. But you can also trigger a response explicitly with `answer()`:

```ts
await voice.answer({
  modalities: ["audio"],
  instructions: "Summarize what was just said",
});
```

`answer()` sends a `response.create` event to the WebSocket. Any options you pass are forwarded as response properties. Call it when you want the model to respond immediately without waiting for voice activity detection.

## Overriding the WebSocket URL

The default WebSocket endpoint for OpenAI Realtime is `wss://api.openai.com/v1/realtime`. Override it with the `url` config option:

```ts
const voice = createOpenAIRealtimeVoice({
  url: "wss://my-proxy.example.com/realtime",
  model: "gpt-4o-mini-realtime-preview-2024-12-17",
});
```

The model name is appended as a query parameter (`?model=...`), so the full connection URL becomes `wss://my-proxy.example.com/realtime?model=gpt-4o-mini-realtime-preview-2024-12-17`. Use this for proxies, local development stubs, or alternative endpoints.

## Configuring the Transcription Model

By default, the OpenAI Realtime provider transcribes incoming audio with `whisper-1`. Change the transcription model with the `transcriber` config option:

```ts
const voice = createOpenAIRealtimeVoice({
  transcriber: "gpt-4o-transcribe",
});
```

The transcriber is sent to the server as part of the `session.update` call immediately after connection. It controls how the realtime API transcribes user audio for the `input_audio_transcription` session property.

## Audio Format Support

When calling `speak()`, you can request a specific audio format via the `format` option:

```ts
const audio = await voice.speak("Hello, world", { format: "opus" });
```

Supported formats:

| Format | Description |
| --- | --- |
| `mp3` | MPEG Layer 3 — widely compatible, lossy |
| `wav` | Waveform Audio — uncompressed, lossless |
| `pcm` | Raw PCM — no header, lowest overhead |
| `opus` | Opus codec — low latency, good for streaming |
| `flac` | Free Lossless Audio Codec |
| `aac` | Advanced Audio Coding — good compression |

Not every provider supports every format. If the provider does not support the requested format, it will use its default. The `AudioFormat` type is exported from `smithers-orchestrator/voice` for type-safe usage.

## Provider-Level Event Callbacks

Realtime voice providers emit events that you can subscribe to with `on()` and unsubscribe from with `off()`:

```ts
const handler = (data) => console.log(data);

voice.on("speaking", handler);   // audio output chunks
voice.on("writing", handler);    // text transcription chunks
voice.on("error", handler);      // provider errors
voice.on("speaker", handler);    // new audio output stream

voice.off("speaking", handler);  // remove a listener
```

| Event | Payload | When |
| --- | --- | --- |
| `speaking` | `{ audio, response_id }` | Each chunk of audio output from the model |
| `writing` | `{ text, role, response_id }` | Each chunk of text transcription |
| `error` | `{ message, code?, details? }` | A provider-level error occurred |
| `speaker` | `ReadableStream` | A new audio response stream was created |

These are provider-level events on the voice instance. They are separate from the Smithers event bus events (`VoiceStarted`, `VoiceFinished`, `VoiceError`) which track operation lifecycle at the workflow level.

## Default Speaker Selection

If you don't specify a `speaker` prop on `<Voice>` or a `speaker` option in the provider config, the default depends on the provider:

- **OpenAI Realtime**: defaults to `"alloy"`
- **AI SDK Voice**: no default — you must pass a speaker via `SpeakOptions` or the provider config, or the underlying model's default is used
- **Composite Voice**: delegates to whichever sub-provider handles the operation

You can override the speaker at three levels (highest priority first):

1. Per-call: `voice.speak("text", { speaker: "shimmer" })`
2. Per-component: `<Voice provider={voice} speaker="coral">`
3. Per-provider: `createOpenAIRealtimeVoice({ speaker: "echo" })`

## OpenAI Realtime: API Key and Environment Fallback

The OpenAI Realtime provider resolves API keys in this order:

1. The `apiKey` config option passed to `createOpenAIRealtimeVoice()`
2. The `OPENAI_API_KEY` environment variable

```ts
// Explicit key
const voice = createOpenAIRealtimeVoice({ apiKey: "sk-..." });

// Or rely on the environment variable — no config needed
const voice = createOpenAIRealtimeVoice();
// Uses process.env.OPENAI_API_KEY automatically
```

If neither is set, `connect()` throws an error.

## OpenAI Realtime: Model Override

Override the realtime model with the `model` config option:

```ts
const voice = createOpenAIRealtimeVoice({
  model: "gpt-4o-realtime-preview",
});
```

The default is `gpt-4o-mini-realtime-preview-2024-12-17`. The model name is appended as a query parameter to the WebSocket URL.

## OpenAI Realtime: Session Management

The OpenAI Realtime provider manages WebSocket session lifecycle automatically:

1. **`connect()`** opens a WebSocket, waits for the `session.created` event, then sends an initial `session.update` to configure the transcription model and default voice.
2. While connected, any calls to `send()`, `speak()`, `listen()`, or `answer()` use the active session.
3. **`close()`** tears down the connection, cleans up speaker streams, and releases resources.

Messages sent before the session is ready are automatically queued and flushed once the connection opens. You don't need to wait for `session.created` yourself — `connect()` returns only after the session is fully initialized.

```ts
const voice = createOpenAIRealtimeVoice({ speaker: "coral" });

await voice.connect();    // waits for session.created + session.update
await voice.send(audio);  // uses the active session
voice.close();            // tears down cleanly
```

If you call `connect()` while already connected, it returns immediately. Concurrent calls to `connect()` are deduplicated — only one connection attempt runs at a time.

## Events and Observability

Voice operations emit structured events:

- `VoiceStarted` -- a voice operation began (speak or listen)
- `VoiceFinished` -- it completed successfully
- `VoiceError` -- something went wrong

These flow through the same event bus as all other Smithers events. The `smithers.voice.operations_total` counter and `smithers.voice.duration_ms` histogram track volume and latency.