# ADR 0003 — Streaming-First Architecture via Lazy Lexer

| Field | Value |
|---|---|
| **Status** | Accepted |
| **Date** | 2026-03-04 (v1.1.0) |
| **Deciders** | Core team |
| **Supersedes** | v1.0 (regex-based parser) |

---

## Context

The v1.0 parser was a regex-based recursive implementation. It had several problems:

1. It loaded the entire document into memory before producing any output
2. Regex-based tokenization produced incorrect positions (line/col) and was fragile on edge cases (CDATA, PIs, nested quotes)
3. The recursive descent caused stack overflow on deeply nested documents (> ~500 levels)
4. It was impossible to add a SAX API without rewriting the core

We needed a new foundation that could support:
- A DOM-building API (full document in memory)
- A SAX/event streaming API (O(depth) memory, no DOM)
- Future streaming validation (O(1) memory, state-machine driven)
- Precise line/col tracking for error messages
- Security limits enforced at the token level (before DOM allocation)

---

## Decision

**Use a lazy-generator token stream from a single-pass state-machine lexer as the foundation for all parsing APIs.**

The architecture has three layers:

### Layer 1 — `XmlLexer` (the single source of truth)

A single-pass state machine that reads the input character-by-character and yields `Token` objects via a JavaScript generator:

```ts
class XmlLexer {
  *tokenizeStream(): Generator<Token, void, unknown> {
    // state machine — no regex on the hot path
    // yields: OPEN_TAG, ATTR_NAME, ATTR_VALUE, TAG_END, TAG_SELF_CLOSE,
    //         CLOSE_TAG, TEXT, CDATA, COMMENT, PI, DOCTYPE, XML_DECL, EOF
  }
}
```

Each token carries `{ type, value, line, col }`. Line/col tracking is exact because the lexer counts them character-by-character.

Security limits (maxDepth, maxAttributes, maxTextLength, maxNodeCount) are checked in the lexer — before any downstream allocation.

### Layer 2 — DOM builder or SAX layer (consumers of the token stream)

**`XmlParser`** (DOM): Consumes the entire token stream via a stack-based loop, building an `XmlDocument` tree. O(nodes) memory.

**`SaxParser`** (SAX): Consumes the token stream in a lazy generator, yielding `SaxEvent` objects. O(depth) memory — only the current nesting stack is held.

Both share the exact same lexer. No code duplication.

### Layer 3 — Higher-level consumers

`SaxInstrumentation`, `XmlStreamParser`, future streaming validators — all consume either the SAX layer or the DOM layer.

### Why a generator?

JavaScript generators are lazy — tokens are only produced when consumed. This means:
- The SAX path never allocates a DOM node
- The lexer can be stopped early (e.g. parse error) without wasted work
- Memory usage is proportional to what the consumer needs

---

## Consequences

### Positive

- **Single tokenizer** — one implementation, all APIs benefit from the same fixes
- **Exact positions** — every token has correct line/col; source-mapped errors are trivial
- **O(depth) SAX memory** — the SAX path never allocates more than the current nesting depth worth of state
- **Security limits at layer 1** — impossible to bypass by crafting input that only triggers after parsing
- **Streaming-ready** — v2.0 streaming validation will plug directly into the SAX layer
- **Testability** — `XmlLexer` is independently testable; `XmlParser` and `SaxParser` are independently testable

### Negative

- **Generator overhead** — `function*` generators have a small overhead vs. a direct callback. Benchmarks show this is negligible (~5% vs. a direct loop).
- **Two-pass for DOM** — DOM building requires the lexer to run to completion; no early-termination optimization for the DOM path (except on error or security limit).

---

## Alternatives Considered

| Alternative | Rejected Because |
|---|---|
| Callback-based lexer (SAX-style from the start) | Callbacks cannot be lazily consumed; no pull-parser API |
| Regex-based tokenizer | Fragile; wrong positions; quadratic on certain inputs |
| Recursive-descent parser | Stack overflow on deep nesting; hard to add SAX path |
| Separate lexers for DOM and SAX | Code duplication; divergent behavior bugs |
| Pull-parser only (no DOM) | DOM required for XPath, validation, serialization |