---
title: Data Model
description: How Smithers stores input payloads, task outputs, and internal workflow metadata.
---

Every database you have ever cursed at got that way for the same reason: someone mixed bookkeeping with business data. Audit timestamps crept into domain objects. Retry counters leaked into API responses. One day you open the schema and cannot tell what the system *does* from what the system *needs to run*.

Smithers refuses to let that happen.

It stores three kinds of data, and it keeps them apart on purpose:

1. the run input payload
2. task output rows
3. internal workflow metadata

Why does the separation matter? Because the question "what did the task produce?" and the question "how many attempts did it take?" have different audiences, different lifecycles, and no business sharing a table. You will thank this design the first time you query your outputs without wading through orchestration columns.

## Run Input

When you kick off a workflow, you hand it a payload. That payload is the *entire* context your workflow gets from the outside world:

```ts
const result = await runWorkflow(workflow, {
  input: { description: "Auth tokens expire silently" },
});
```

Think of it as the function argument for the whole run. Smithers persists it once, and every task in the workflow reads it through `ctx.input`. If the run crashes and resumes, the same input is still there -- unless you explicitly override it.

So what should go in `input`? Three things:

- user-supplied run context
- durable across resume
- available everywhere through `ctx.input`

Nothing more. If a value is produced *during* the run, it belongs in a task output, not in the input.

## Task Outputs

Here is where your domain data lives. Most Smithers workflows define output schemas up front with `createSmithers(...)`:

```tsx
const { Workflow, Task, smithers, outputs } = createSmithers({
  analysis: z.object({
    summary: z.string(),
    severity: z.enum(["low", "medium", "high"]),
  }),
});
```

Notice that the schema describes *your* data -- a summary string and a severity level. That is it. No run IDs, no iteration counters, no attempt numbers. You define the shape of the answer; Smithers handles everything else.

Behind the scenes, each schema key becomes a durable SQLite table. Smithers automatically:

- creates the SQLite table
- maps the schema key to a snake_case table name
- adds `runId`, `nodeId`, and `iteration` bookkeeping columns
- validates agent output before persisting it

Your prompt-facing schema stays clean:

```ts
z.object({
  summary: z.string(),
  severity: z.enum(["low", "medium", "high"]),
})
```

"Wait," you might be thinking, "if Smithers adds bookkeeping columns anyway, why can't I just add them myself?" You *can*. But then your LLM prompt includes fields it should never fill, your validation conflates domain rules with runtime plumbing, and your query results mix what the task *said* with how it got there. You do not need to add fields like:

- `runId`
- `nodeId`
- `iteration`
- `attempt`
- approval metadata

Smithers owns those. Let it.

## Identity of an Output Row

Here is a subtlety that trips people up. Two different tasks can write to the same output schema. The same task can write to it ten times inside a loop. So how does Smithers know which row is which?

The answer: output identity is not "table name only." Each row is keyed by:

- run id
- task id (`nodeId`)
- iteration when the task is inside a loop

That is why `ctx.output(...)`, `ctx.outputMaybe(...)`, and `ctx.latest(...)` all require both an output target and a `nodeId`. The table tells Smithers *where* to look. The node ID and iteration tell it *which row* you mean.

## Custom Drizzle Tables

Sometimes you already have a table, or you need a schema that Smithers cannot auto-generate. In that case, `<Task output={...}>` can point at a custom Drizzle table.

Fair warning: when you go this route, you take on responsibility that Smithers normally handles for you:

- creating and migrating the table
- including Smithers bookkeeping columns such as `runId` and `nodeId`
- including `iteration` in looped tasks
- optionally pairing the table with `outputSchema` for stricter validation

This is an escape hatch, not the default path. If `createSmithers(...)` can express your schema, use it.

## Internal Smithers Metadata

Open your database and you will see tables prefixed with `_smithers_`. Do not be alarmed. These are Smithers' own operational tables:

| Table | Purpose |
|-------|---------|
| `_smithers_runs` | One row per workflow run. Tracks status, heartbeat, VCS revision, and error. |
| `_smithers_nodes` | Current state of each task node within a run (pending, running, finished, failed). |
| `_smithers_attempts` | Every execution attempt for every node, including start/finish timestamps and error detail. |
| `_smithers_frames` | The rendered JSX tree at each commit boundary, stored as serialized XML. |
| `_smithers_approvals` | Approval requests and decisions for tasks gated by `<Approval>`. |
| `_smithers_human_requests` | Human-in-the-loop requests (form fills, confirmations) and their responses. |
| `_smithers_cache` | Cached task outputs keyed by workflow, node, schema signature, and agent signature. |
| `_smithers_sandboxes` | Sandbox session metadata for bubblewrap and container-based execution. |
| `_smithers_tool_calls` | Per-call log of every tool invocation: input, output, latency, and status. |
| `_smithers_events` | Sequential event journal for a run. Source of truth for all observable events. |
| `_smithers_ralph` | Loop (`<Loop>`) iteration counters and completion flags. |
| `_smithers_cron` | Cron schedule definitions, last-run and next-run timestamps. |
| `_smithers_scorers` | Scorer results for each task attempt: score, reason, and latency. |
| `_smithers_vectors` | RAG vector store: chunk text, embeddings (as BLOBs), and metadata. |
| `_smithers_signals` | Inbound signals received by waiting runs. |

This is the machinery that lets Smithers resume a crashed run, retry a failed task, or tell you exactly what happened at 3 a.m. It exists so your output tables never have to carry orchestration concerns.

## Table Schema Ensurance and Auto-Migration

Smithers calls `ensureSmithersTables()` at startup, which runs `CREATE TABLE IF NOT EXISTS` for every internal table. You never need to run migrations by hand for `_smithers_*` tables.

For your own output tables defined via `createSmithers(...)`, Smithers also auto-migrates columns. When the Drizzle schema defines a column that is missing from the SQLite table on disk, Smithers issues an `ALTER TABLE ... ADD COLUMN` statement to add it. Columns that exist in the database but are absent from the schema are left in place -- Smithers does not remove data.

This forward-only migration means you can add fields to an output schema and existing runs will continue to work. Removing a field or changing a column type requires a manual migration or a fresh database.

## Schema Signature Verification

Before persisting a cached task result, Smithers computes a schema signature for the output table. The signature is a SHA-256 hash of the table name and every column's name, type, nullability, and primary key flag, all sorted alphabetically:

```
sha256("tableName|colA:text:1:0|colB:integer:0:0|...")
```

This hash is stored as `schema_sig` in `_smithers_cache`. When a cached result is retrieved, Smithers checks that the current table's signature still matches. If the schema changed since caching, the cached entry is ignored and the task runs fresh. You never get silently stale cache hits after a schema migration.

## Transaction Model

`SmithersDb` uses a single-writer transaction model with a serial promise queue. Every write operation (including those outside an explicit transaction) acquires a turn in a `transactionTail` promise chain before proceeding. This serializes all writes even when multiple Effect fibers run concurrently.

Explicit transactions use `BEGIN IMMEDIATE` so SQLite acquires a write lock immediately, preventing lock contention with concurrent readers:

```ts
await adapter.withTransaction("my-write-group", effect);
```

Nested transactions from the same fiber are detected and rejected -- SQLite does not support true savepoints through this interface. The transaction depth counter and owner-thread tracking ensure the same fiber can perform multiple writes within a single transaction without re-acquiring the queue turn.

## Write Retry and Exponential Backoff

All write paths wrap the underlying operation with `withSqliteWriteRetryEffect`. When a write fails with `SQLITE_BUSY`, `SQLITE_IOERR`, "database is locked", or "disk i/o error", Smithers retries up to six times with exponential backoff:

- Base delay: 50 ms
- Maximum delay: 2,000 ms
- Jitter: ±25% of the computed delay
- Each retry increments the `smithers.db.retries` counter

After the maximum number of retries, the error propagates as a `DB_WRITE_FAILED` SmithersError. This makes Smithers resilient to transient WAL-mode lock contention without requiring any configuration.

## Frame Codec

Render frames in `_smithers_frames` are stored in one of three encodings:

| Encoding | When used | Description |
|----------|-----------|-------------|
| `full` | Frame 0 and any keyframe | Complete serialized XML of the render tree |
| `delta` | Frames between keyframes | JSON patch (set, insert, remove ops) relative to the previous frame |
| `keyframe` | Every 50th frame | Same as `full`; resets the delta chain |

The keyframe interval is 50 frames (`FRAME_KEYFRAME_INTERVAL = 50`). Reading an arbitrary frame requires loading the nearest preceding keyframe and applying all deltas up to the target frame number. An in-memory LRU cache (up to 512 entries) stores reconstructed frame XML so repeated reads of hot frames are free.

Delta encoding uses a structural diff algorithm that walks the XML JSON tree, emitting `set`, `insert`, and `remove` operations. It is node-ID-aware: when comparing adjacent objects in the tree, it uses the `id` prop of element nodes as a stable identity anchor, so reordered elements produce insert/remove pairs rather than spurious updates.

## Signal Persistence

Signals are external messages sent to a running workflow. When a signal arrives, Smithers writes it to `_smithers_signals` with an automatically allocated sequence number. You never pick the `seq` yourself -- Smithers computes `MAX(seq) + 1` inside a `BEGIN IMMEDIATE` transaction so two concurrent signals never collide.

Before inserting, the adapter checks whether an identical signal already exists (same `runId`, `signalName`, `correlationId`, `payloadJson`, `receivedAtMs`, and `receivedBy`). If a match is found, the existing `seq` is returned and no duplicate row is created. This deduplication prevents replay or retry from doubling signals.

### Signal Query Filters

Querying signals supports four filters, all optional:

| Filter | Column | Description |
|--------|--------|-------------|
| `signalName` | `signal_name` | Match a specific signal type |
| `correlationId` | `correlation_id` | Match a specific correlation key (supports `null`) |
| `receivedAfterMs` | `received_at_ms` | Only signals received at or after this timestamp |
| `limit` | -- | Max rows to return (default 200) |

Results are ordered by `seq ASC`, so you always see signals in arrival order.

## Event Persistence

The `_smithers_events` table is the durable event journal for each run. Every `SmithersEvent` emitted during execution is persisted here with a sequential `seq` number that serves as the total ordering.

### Auto-Sequence Allocation

Like signals, events get their `seq` via `SELECT COALESCE(MAX(seq), -1) + 1` inside a `BEGIN IMMEDIATE` transaction. This guarantees gap-free, monotonically increasing sequence numbers per run.

### Insert Deduplication

Before inserting, the adapter checks for an existing row matching the same `runId`, `timestampMs`, `type`, and `payloadJson`. If found, the existing `seq` is returned without creating a duplicate. This makes event insertion idempotent across retries.

### Event Queue and Flush

For performance, events can be enqueued asynchronously via `emitEventQueued`. The event is emitted to listeners and tracked immediately, but database and log-file persistence happens in a background promise chain (`persistTail`). Call `flush()` to await all queued persistence -- the engine does this at task boundaries and run completion to ensure nothing is lost.

### Sequence Start Override

The `EventBus` constructor accepts a `startSeq` option, which sets the initial sequence counter. This is used on resume to continue from where the previous run left off, preventing sequence number collisions with already-persisted events.

### Event History Queries

The adapter supports filtered history queries with these parameters:

| Filter | Description |
|--------|-------------|
| `afterSeq` | Return events with `seq > afterSeq` |
| `limit` | Max rows |
| `nodeId` | Filter by `$.nodeId` inside the payload JSON |
| `types` | Filter to specific event type strings |
| `sinceTimestampMs` | Events at or after this timestamp |

A separate `countEventHistory` method returns the count matching the same filters, useful for pagination.

## Human Request Persistence

Human requests (form fills, confirmations, free-text prompts) are stored in `_smithers_human_requests` with lifecycle states: `pending`, `answered`, `cancelled`, `expired`.

### Pending Inbox Query

`listPendingHumanRequests` returns all pending requests across all runs, joined with `_smithers_runs` and `_smithers_nodes` to include the `workflowName`, `runStatus`, and `nodeLabel`. Before returning, it automatically expires any requests whose `timeoutAtMs` has passed, transitioning them to `expired` status.

### Answer Persistence

`answerHumanRequest` sets the response JSON, timestamp, and optional `answeredBy` field, transitioning the request from `pending` to `answered`. Only pending requests can be answered -- the `WHERE status = 'pending'` clause prevents double-answering.

### Cancellation

`cancelHumanRequest` transitions a pending request to `cancelled`. Like answering, it only operates on requests in `pending` status.

## Cron Persistence

Cron schedules are stored in `_smithers_cron` and managed through the adapter:

| Operation | Method | Description |
|-----------|--------|-------------|
| Create/Update | `upsertCron` | Inserts or updates a cron schedule by `cronId` |
| List | `listCrons(enabledOnly?)` | Returns all cron entries, optionally filtering to `enabled = true` |
| Track execution | `updateCronRunTime` | Updates `lastRunAtMs`, `nextRunAtMs`, and optional `errorJson` |
| Delete | `deleteCron` | Removes a cron entry by ID |

The `enabled` flag allows disabling a schedule without deleting it. The `lastRunAtMs` and `nextRunAtMs` columns let the scheduler know when to fire next without recomputing from the cron pattern on every poll. If a scheduled run fails, the error is stored in `errorJson` on the cron row for diagnostics.

## Run Lifecycle Management

### Stale Run Claims

The supervisor detects stale runs by querying `_smithers_runs` for rows with `status = 'running'` whose `heartbeat_at_ms` is older than the stale threshold (default 30 seconds). To safely resume a stale run without races, the supervisor uses a compare-and-swap pattern:

1. **Claim**: `claimRunForResume` atomically sets `runtime_owner_id` and `heartbeat_at_ms` only if the current values match the expected stale state. The `WHERE` clause checks `runtime_owner_id`, `heartbeat_at_ms`, and the stale threshold in a single `UPDATE`, and returns whether the row was modified.

2. **Release**: If the supervisor decides not to resume after claiming, `releaseRunResumeClaim` restores the original `runtime_owner_id` and `heartbeat_at_ms`, but only if the claim is still held (the current `runtime_owner_id` matches the claimer).

This two-phase claim prevents two supervisor instances from resuming the same stale run simultaneously.

### Sandbox Tracking

Sandbox sessions (bubblewrap, Docker, or Codeplane) are tracked in `_smithers_sandboxes`. The adapter upserts sandbox rows keyed by `(runId, sandboxId)`, recording runtime type, configuration, status, shipping and completion timestamps, and bundle paths.

## Output Edge Cases

### Payload-Only Tables

When an output table's only non-bookkeeping column is `payload`, Smithers detects it and wraps the entire agent output into that single column instead of spreading fields across multiple columns. This is useful for unstructured or polymorphic outputs where a fixed column set does not make sense.

### Boolean Column Coercion

Bun's SQLite driver returns raw `0`/`1` integers for columns declared with `{ mode: "boolean" }` in Drizzle. When loading output snapshots, Smithers detects these columns by inspecting the Drizzle table metadata and coerces the integer values to proper JavaScript booleans. Without this, strict equality checks like `value === true` would fail.

### Schema Key Aliasing

When loading outputs via `loadOutputs`, each result set is stored under **both** the schema key (e.g., `"analysis"`) and the actual SQLite table name (e.g., `"analysis"` or a custom name). This dual indexing lets downstream code reference outputs by either name, which matters when schema keys and table names diverge (e.g., with custom Drizzle tables).

## Snapshot Persistence

Loading a complete workflow snapshot (`loadInput` + `loadOutputs`) reconstructs the full `ctx` state from SQLite. The input is loaded by filtering the input table for the current `runId`. Outputs are loaded by iterating every schema key, querying each table for rows matching the `runId`, applying boolean coercion, and indexing under both schema key and table name.

This snapshot is what powers resume: when a crashed run restarts, the snapshot populates `ctx` so the JSX tree renders with all completed outputs already in place.

## Transaction Internals

### Read Gating

Reads, not just writes, also acquire a turn in the `transactionTail` promise queue. This prevents reads from seeing intermediate state during a multi-statement transaction. If the current fiber already owns the active transaction, reads proceed immediately without acquiring a new turn.

### Commit Retry

The entire `withTransaction` call is wrapped in `withSqliteWriteRetryEffect`. If the `COMMIT` (or `BEGIN IMMEDIATE`) fails with `SQLITE_BUSY` or an I/O error, the retry mechanism rolls back and retries the full transaction from `BEGIN`, using the same exponential backoff as standalone writes.

## Why the Separation Matters

Ask two questions about any completed task:

Your workflow output answers: *what did this task produce?*

Smithers metadata answers:

- when did it run?
- how many attempts did it take?
- was it cached?
- did it wait for approval?
- which loop iteration produced it?

These are fundamentally different concerns. Mixing them is like storing a book's page count in the same field as its ISBN -- technically possible, obviously wrong. Keep them apart and both stay easy to reason about.

## Schema Changes

Changing a Zod output schema is not just a prompt tweak. It is a persistence change. The table on disk has to match the schema in code.

Typical examples:

- adding a field
- removing a field
- changing a field type
- tightening validation rules

In hot-reload mode, Smithers blocks these changes and requires a restart so output resolution stays deterministic. This is deliberate friction -- it forces you to think about the migration before the data gets inconsistent.

If you use custom Drizzle tables, you must manage those migrations yourself.

## Direct Queries

Smithers does not hide SQLite from you. The database is right there. Open it, poke around, write queries.

Use output tables when you care about business results.
Use `_smithers_*` tables when you care about execution history.

This is one of the advantages of keeping the layers separate: you can hand your output tables to an analyst who has never heard of Smithers, and the data makes sense on its own.

## Mental Model

When in doubt, apply this rule of thumb:

- `ctx.input` is run-scoped input
- output tables hold validated task results
- `_smithers_*` tables hold orchestration state

If a field only exists to help the runtime schedule or resume work, it belongs in Smithers metadata, not in your domain schema. If a field describes what the task actually produced, it belongs in an output table, not in `_smithers_*`. The line is clean. Keep it that way.

## Next Steps

- [Execution Model](/concepts/execution-model) -- See how these tables participate in render, scheduling, and resume.
- [Structured Output](/guides/structured-output) -- Validation and persistence details for task outputs.
- [Debugging](/guides/debugging) -- Query the internal tables directly when a run behaves unexpectedly.
