# @adia-ai/a2ui-corpus

Corpus and operational-learning artifacts for the gen-UI pipeline —
**chunks** (the canonical retrieval surface), feedback, gaps, eval
fixtures. Pure data plus the scripts that maintain it. No runtime.

> The pipeline reads this package; this package never reads the pipeline.
> See [`@adia-ai/a2ui-compose`](../compose) for engine code,
> [`@adia-ai/web-components`](../../web-components) for UI atoms,
> [`@adia-ai/a2ui-runtime`](../runtime) for the A2UI runtime (renderer,
> registry, streams, wiring),
> [`@adia-ai/a2ui-mcp`](../mcp) for the MCP server.

## Install

```bash
npm install @adia-ai/a2ui-corpus
```

Pure data — typically consumed transitively by [`@adia-ai/a2ui-compose`](../compose) and [`@adia-ai/a2ui-mcp`](../mcp), which list it as a runtime dependency. Direct installs are useful for offline eval tooling or building bespoke retrieval layers.

## Dependency direction

```
a2ui-compose   ──reads──▶  a2ui-corpus
a2ui-mcp       ──reads──▶  a2ui-compose, a2ui-corpus
web-components ──used-by──▶  apps/, playgrounds/, catalog/ (chunk sources)
```

No back-writes. No circular reads. Web-components ships UI atoms only;
corpus lives here; runtime lives in a2ui-compose. The chunk pipeline
harvests `data-chunk`-tagged regions from the apps that consume
web-components, so the read-direction stays one-way.

## Glossary

The corpus has converged on **one** retrievable concept: the **chunk**.

| Term | Source of truth | Granularity | What it carries | Engine that consumes it |
| --- | --- | --- | --- | --- |
| **chunk** | `chunks/<id>.json` (one file per chunk) + `_index.json` | A single labeled HTML region, harvested via `data-chunk` markers | `html`, `intent`, `domain`, `kind` (`block` / `page` / `panel`), `keywords`, `metadata` (when annotated), `template` (transpiled A2UI tree, when annotated) | `chunk-zettel` + `monolithic-pro` + `zettel` (composition-library wraps annotated chunks) |

**Annotated vs raw chunks**: every chunk has `source`/`page` provenance,
but only chunks carrying `data-chunk-{domain,description,keywords,kind}`
attributes on their source HTML become retrievable as compositions
(the harvester's transpile pass produces a `template` + lifts the
`metadata` block onto them). Raw chunks remain as substrate for
`nested-expand` reference resolution but don't compete in retrieval.

**Grounding rule (locked v0.4.6, enforced v0.4.7):** every retrievable
unit MUST trace to a real page under `site/pages/`, `apps/`,
`playgrounds/`, or `catalog/` via its `data-chunk-*` annotations. No
hand-authored ungrounded JSON lives in `corpus/`.

> **Historical note:** *Fragments* (atomic A2UI sub-trees with named
> slots, §37, 2026-05-12), *patterns* (hand-authored full-canvas A2UI
> templates, `patterns/` dir), and *compositions* (hand-authored
> multi-section A2UI surfaces, `compositions/` dir) were earlier corpus
> formats. Fragments retired in v0.4.4 (§37). Patterns + compositions
> retired in v0.4.7 (§72 — the v0.4.6 carryover of §65). Their
> retrievable equivalents were either (a) already represented in the
> chunks corpus via annotated `data-chunk-*` regions, or (b) deemed
> non-grounded under the locked rule and DELETE'd in §73 Mode-C triage
> (target for v0.4.7). See [`docs/journal/2026/05/2026-05-12.md`](../../../docs/journal/2026/05/2026-05-12.md)
> §§ 36-§42, §65, §72 for the multi-arc retirement narrative.

## Layout

```
a2ui/corpus/
├── chunks/                  retrievable + raw chunks (~190 entries)
│                              — one JSON per chunk; carries `source`,
│                              `metadata` (when annotated), `template`
│                              (transpiled A2UI tree, when annotated).
│                              Harvested by `scripts/build/harvest-chunks.mjs`
│                              from data-chunk markers across site/pages/*,
│                              apps/*, playgrounds/*, catalog/*.
│   └── _index.json            harvester output — name + by-kind tallies +
│                              normalized chunk list (shape consumed by
│                              chunk-loader + composition-library)
│
├── evals/                   held-out.jsonl + eval fixtures
├── feedback/                daily JSONL — user feedback events
├── gaps/                    gap registry — prompts with missing coverage
├── scripts/                 maintenance tooling (extract, ingest, feedback, ticket)
│   └── chunk-library.js       in-memory loader + keyword/semantic search over chunks/
│
├── catalog-a2ui_0_9.json       aggregated artifact — what a2ui-compose + a2ui-mcp read
├── catalog-a2ui_0_9_rules.txt  natural-language composition rules (per component)
├── common_types.json           shared A2UI type shapes
├── chunk-embeddings.json       pre-computed embeddings for chunk semantic search (0.0.4+)
├── functions.json              declarative wiring-engine function catalog
├── manifest.json               extraction metadata (what / when / counts)
├── pattern-specs.md            written specs for each pattern category (historical reference)
└── data-flow.md                how the signal sources feed the pipeline
```

## What's committed vs generated vs published

| Kind                       | Committed | Published to npm | Source of truth                        |
|----------------------------|:---------:|:----------------:|----------------------------------------|
| Chunks (`chunks/`)         | ✓         | ✓                | `scripts/build/harvest-chunks.mjs` (from `data-chunk` markers across site/pages, apps, playgrounds, catalog) |
| `catalog-a2ui_0_9.json`    | ✓         | ✓                | `npm run components` (assembled from yamls in `packages/web-components/`) |
| `chunk-embeddings.json`    | ✓         | **✗ since 0.2.1**| `scripts/build/embeddings-chunks.mjs`  |
| Feedback JSONL             | ✓         | ✓                | Written by `@adia-ai/a2ui-retrieval` at runtime |
| Gap registry               | ✓         | ✓                | Written by `@adia-ai/a2ui-retrieval` at runtime |

Extracted artifacts are committed for convenience (avoids a build step to
read the pipeline), but the scripts are authoritative — regenerate via
`npm run harvest:chunks` (chunks), `npm run components` (catalog), or
`npm run build:embeddings:chunks` (chunk embeddings) if anything drifts.

### Why embeddings ship via git, not npm

`chunk-embeddings.json` (~20 MB) is **committed in git** so the monorepo's
own pipeline runs without a network round-trip, but **excluded from the
published npm tarball** — every `npm i @adia-ai/a2ui-corpus` was pulling
~20 MB of pre-computed float arrays that consumers had no reliable way
to address (the `chunk-embedding-retriever` resolves them via a
relative-path that breaks under a `node_modules/@adia-ai/a2ui-corpus/`
install layout).

> The companion `pattern-embeddings.json` retired in v0.4.7 §72 along
> with the `patterns/` source directory. Its only consumer was
> `concept-mapper.js` (dead post-v0.4.6 §64 retirement of
> `pattern-library.js`). The pattern-embeddings build script
> (`scripts/build/embeddings.mjs`) and the matching
> `embedding-retriever.js` were retired in the same arc.

Consumers who want embedding-based retrieval either:

1. Regenerate locally — `npm run build:embeddings:chunks` produces the
   `chunk-embeddings.json` file in your `node_modules/@adia-ai/a2ui-corpus/`
   checkout. Requires API access to your embedding provider.
2. Use the keyword-only fallback — `chunk-library.searchChunks()` works
   without embeddings; the embedding-aware `searchChunksAsync` path
   falls through to keyword scoring when the index file is absent
   (`chunk-embedding-retriever.js` returns `null` gracefully).

### Embedding model pinning

The `provider` and `model` recorded in each `*-embeddings.json` header
are the **source of truth** at query time. The retrievers
(`chunk-embedding-retriever.js`, `embedding-retriever.js`) re-resolve
the same embedder from those header fields — they do **not** auto-pick
a different provider when the recorded one's API key is unset, because
cross-model cosine similarity is meaningless and same-provider/
different-model emits different-dim vectors that `cosine()`
short-circuits to 0 (silent retrieval failure).

Currently pinned defaults:

| Provider | Model | Dims | Env |
|---|---|---:|---|
| openai | `text-embedding-3-small` | 1536 | `OPENAI_API_KEY` |
| voyage | `voyage-3-lite` | 1024 | `VOYAGE_API_KEY` |

`detectProvider()` (in `packages/a2ui/retrieval/embedding/embedding-provider.js`)
prefers Voyage when both keys are present (denser vectors, lower cost).
The `build:embeddings*` scripts record the chosen provider/model into the
`.json` header, so subsequent reads always re-bind to the same model.

When upgrading to a new model (e.g. `text-embedding-3-small` →
`text-embedding-3-large`):

1. Update the default in `embedding-provider.js`.
2. Rebuild the chunk index (`npm run build:embeddings:chunks`).
3. Verify with `npm run check:embeddings-fresh` that both index headers
   record the new model.
4. Re-run `npm run eval:diff -- --engine zettel` to confirm the new
   model doesn't regress retrieval quality (different models score
   queries differently — thresholds in `chunk-synthesizer.js` may
   need a re-look, though the absolute keyword score floor remains
   independent).

Don't mix models. If one index records `voyage-3-lite` and the other
records `text-embedding-3-small`, the retrievers will load both fine
but the rankings will be incomparable across the two corpora.

## Scripts

All run from repo root via npm:

```bash
npm run harvest:chunks       # full re-harvest of chunks/ from data-chunk markers
npm run components           # regenerate v0.9 sidecars + assemble catalog-a2ui_0_9.json
npm run components -- --verify   # fail if catalog/sidecars are stale vs yamls

npm run feedback:report      # human-readable feedback digest
npm run feedback:promote     # promote high-confidence feedback → new training data

npm run ticket               # open ticket tracker
npm run ticket:list          # list open tickets
npm run ticket:create        # create a ticket against corpus/pipeline
```

Script inventory (`scripts/`):

| Script                    | Purpose                                              |
|---------------------------|------------------------------------------------------|
| `chunk-library.js`        | In-memory loader + keyword/semantic search over `chunks/` |
| `feedback-report.js`      | Aggregates feedback JSONL into a readable digest     |
| `feedback-promote.js`     | Moves high-confidence feedback into training data    |
| `ticket.mjs`              | Corpus/pipeline issue tracker                        |

> **Retired in v0.4.7 §72** (with the `patterns/` + `compositions/` dirs):
> `extract.js`, `ingest.js`, `run-pipeline.mjs`, `build-pattern-index.mjs`.
> These fed the pattern-library retrieval surface (`pattern-library.js`,
> retired v0.4.6 §64) and the exemplar→chunks pipeline (retired v0.4.4
> §36). Pattern embeddings (`scripts/build/embeddings.mjs`) and the
> grounded-corpus triage audit (`scripts/audit/grounded-corpus-triage.mjs`)
> also retired in the same arc — the corpus is now one-format
> (chunks-only) and harvester-driven.

Repo-side build scripts (not in tarball; run from the workspace root):

| Script                              | Purpose                                                  |
|-------------------------------------|----------------------------------------------------------|
| `npm run harvest:chunks`            | Walks site/pages/*, apps/*, playgrounds/*, catalog/*, harvests every `[data-chunk]` element, writes `chunks/<name>.json` + `_index.json` |
| `npm run build:embeddings:chunks`   | Generates `chunk-embeddings.json` (~190 chunks × 1536d)  |

## Exports

```javascript
// Catalog — the aggregated read-target for engines.
// Carries per-component aliases under `components[name].x-adiaui.synonyms.tags`.
import catalog from '@adia-ai/a2ui-corpus';

// Chunk corpus (since 0.0.3 / 0.0.4)
import chunkIndex from '@adia-ai/a2ui-corpus/chunks';   // _index.json (metadata only)
import { searchChunks, searchChunksAsync, getChunk }
  from '@adia-ai/a2ui-corpus/chunk-library';            // in-memory query API
```

## Authoring order — demo page → `data-chunk` marker → training

When adding coverage for a new intent:

1. **Live demo page** — author the HTML in
   `apps/<name>/app/<demo>/<demo>.contents.html` (or under
   `playgrounds/` / `catalog/`) using pure primitive composition. See
   repo-root `AGENTS.md`.
2. **Tag the reusable region** — add `data-chunk="<slug>"` +
   `data-chunk-kind="<kind>"` (block / page / panel / field) on the
   element. The harvester extracts the bounding HTML on the next
   build. See `docs/specs/genui-chunk-marker.md` for the marker
   convention.
3. **Harvest + ingest** — `npm run harvest:chunks` writes
   `chunks/<slug>.json` and refreshes `_index.json`. `npm run pipeline`
   does the full extract → ingest → catalog refresh.
4. **Verify** — `npm run eval:diff -- --engine zettel` should still
   hold coverage ≥ 83%, avgScore ≥ 88 (per the regression floors in
   `AGENTS.md`).

See `data-flow.md` for the full pipeline (chunks → feedback).

## Regression floors

The pipeline must hold these thresholds — tracked in the held-out
benchmark:

- **Fragment reuse ratio ≥ 29.9%** — 167 refs / 559 composition nodes
- **Zettel**: coverage 100%, avgScore ≥ 88, MRR ≥ 0.94
- **Monolithic**: coverage 100%, avgScore ≥ 95
- **Dogfood**: 20/20 intents at avg ≥ 95

## What this package does NOT contain

- Pipeline runtime — `gen-ui/`
- UI custom elements — `web-components/`
- MCP transport — `gen-ui-mcp/`
- Site / playground UI — `/site/`

If a file here is `.js` / `.mjs`, it's a maintenance script, not runtime.
Runtime readers go through `gen-ui/retrieval/*`.

## License

MIT
