# ADR 0002 — DFA-Style Content Model Validation

| Field | Value |
|---|---|
| **Status** | Accepted (lite implementation); Full DFA planned for v2.0 |
| **Date** | 2026-04-14 (v1.4.0 — SchemaCompilerLite) |
| **Deciders** | Core team |
| **Supersedes** | — |

---

## Context

XSD content models (the `xs:sequence`, `xs:choice`, `xs:all`, `xs:group` tree with occurrence constraints) define what sequences of child elements are valid for a given complex type. There are several approaches to evaluating a content model against a document element's children:

### Approach 1: Recursive backtracking

Walk the compositor tree recursively, trying each branch. On failure, backtrack and try the next alternative.

- **Pro:** Simple to implement; handles ambiguous grammars
- **Con:** Exponential worst-case (O(2^n) for deeply nested choices); unbounded stack depth for deeply nested schemas

### Approach 2: NFA (Non-deterministic Finite Automaton)

Convert the content model to an NFA (Thompson construction) and evaluate with ε-closure.

- **Pro:** Polynomial time; correct for all unambiguous grammars
- **Con:** NFA construction adds ~3× code complexity; stateful ε-closure tracking needed per-element

### Approach 3: DFA (Deterministic Finite Automaton)

Convert the NFA to a DFA (subset construction). Each state has exactly one transition per element name.

- **Pro:** O(n) per element validation; no backtracking; streaming-safe
- **Con:** Exponential state blowup in the worst case (rarely reached in practice); upfront compilation cost

### Approach 4: Single-pass compositor walk (current — v1.4–v1.6)

Walk the compositor tree once, left-to-right, tracking position. On mismatch, emit an error and optionally continue (recover mode). No backtracking.

- **Pro:** Simple; O(n) per element; easy to implement recovery mode
- **Con:** Cannot handle ambiguous grammars perfectly; not streaming-safe without a state snapshot

---

## Decision

### v1.4–v1.6: Single-pass compositor walk with recovery mode

The current `ValidationEngine` uses a single-pass left-to-right walk of the compositor tree. `SchemaCompilerLite` resolves type references and flattens extension chains before validation begins, so the compositor tree is fully normalized.

This approach handles ~78% of real-world XSD content models correctly. The remaining ~22% (highly ambiguous grammars, `xs:all` with complex occurrence constraints, deeply nested choice trees) may produce false positives or imprecise error messages.

`ValidationOptions.recover: true` allows the engine to continue past structural errors, collecting as many issues as possible. This is required for IDE integrations (G15).

### v2.0: Full DFA compilation

In v2.0, `SchemaCompilerLite` will be replaced by a full DFA compiler:

1. Convert each content model to an NFA (Thompson construction from the compositor tree)
2. Determinize the NFA to a DFA (subset construction with state caching)
3. Minimize the DFA (Hopcroft's algorithm)
4. Cache the DFA per complex type in `CompiledSchema`

The DFA enables:
- O(n) per element validation (no backtracking, no stack)
- True streaming validation (G1) — each SAX event drives a state transition
- Precise error messages (which state transition failed, what was expected)
- Formally correct handling of all XSD content models

---

## Consequences

### Current (v1.4–v1.6)

- Fast and sufficient for ~75% of real-world XSD schemas
- Recovery mode works for IDE integrations
- Some ambiguous content models produce imprecise errors (acceptable for now)
- Not streaming-safe (DOM required)

### v2.0 DFA

- Full XSD 1.0 content model correctness
- O(n) streaming validation without DOM
- More complex codebase (~600 additional lines for DFA construction)
- Compilation cost paid once per schema (cached in `CompiledSchema`)

---

## Alternatives Considered

| Alternative | Status |
|---|---|
| Full backtracking recursion | Rejected — exponential worst-case risk |
| NFA evaluation | Considered for v2.0; DFA chosen for streaming compatibility |
| Third-party schema validator (Xerces bindings) | Rejected — violates zero-dependency rule |
| Continue with compositor walk in v2.0 | Rejected — insufficient for streaming SAX validation (G1) |

