# ATR Language Detection Algorithm v1.0

> **STATUS: PROPOSED v1.0 — NOT YET RATIFIED.** This specification describes
> a target algorithm for community comment. The current TypeScript production
> engine continues to use its existing per-rule language detection. See
> `STANDARDIZATION-STATUS.md` for full status.

**Status:** Draft for AEP-001 ratification — NOT RATIFIED
**Date:** 2026-05-25
**License:** CC BY 4.0
**Required by (on ratification):** Any rule that declares `condition.language` (i.e., a per-language regex condition)

---

## Why this spec exists

ATR rules support per-language conditions:

```yaml
detection:
  conditions:
    - field: user_input
      operator: regex
      language: en
      value: "ignore (?:all )?previous instructions"
    - field: user_input
      operator: regex
      language: zh-Hant
      value: "(?:忽略|無視)(?:前面所有|之前所有|所有先前)的?指示"
```

If different conformant engines disagree on which language a given
input belongs to, the **same input fires different rules in different
engines**. The rule corpus becomes non-portable. This is the
detection-standard equivalent of a heisenbug.

This document specifies a **deterministic algorithm** that all
conformant engines MUST implement. Any conformant ATR engine running
this algorithm on the same input must return the same language code.

---

## Algorithm specification

### Input

- `text`: a Unicode string of arbitrary length, encoded UTF-8.

### Output

- A two-letter ISO 639-1 language code from the supported set, OR the
  three-letter ISO 639-3 code `und` (undetermined).

### Supported languages (v1.0)

| Code | Language | Unicode blocks (primary) |
|---|---|---|
| `en` | English | Basic Latin |
| `zh-Hant` | Traditional Chinese | CJK Unified Ideographs (script-tagged Traditional via Unihan kZVariant inversion when available; defaults to Traditional for Taiwan / Hong Kong corpora) |
| `zh-Hans` | Simplified Chinese | CJK Unified Ideographs (script-tagged Simplified) |
| `ja` | Japanese | Hiragana + Katakana + CJK Unified Ideographs |
| `es` | Spanish | Latin Extended-A + Latin-1 Supplement subset |
| `ar` | Arabic | Arabic + Arabic Supplement |

Additional languages may be added via AEP. Engines that do not
implement a language MUST report `und` for inputs in that language,
NOT fall back to a default.

### Algorithm (deterministic, single-pass)

```text
function detectLanguage(text: string) -> string {
  if length(text) == 0:
    return "und"

  // Phase 1: Unicode block frequency
  blockCounts = empty histogram
  totalCodepoints = 0
  for codepoint in iterateUnicodeCodepoints(text):
    if isWhitespace(codepoint) or isPunctuation(codepoint):
      continue
    blockCounts[unicodeBlockOf(codepoint)] += 1
    totalCodepoints += 1

  if totalCodepoints == 0:
    return "und"

  // Phase 2: dominant-block heuristic
  THRESHOLD_DOMINANT = 0.60
  dominantBlock, dominantCount = argmax(blockCounts)
  if dominantCount / totalCodepoints < THRESHOLD_DOMINANT:
    return classifyMixedScript(blockCounts, totalCodepoints)

  // Phase 3: block-to-language mapping
  switch dominantBlock:
    case BASIC_LATIN:
      // English is the default Latin script. Spanish detected only
      // if Latin-1 Supplement subset (¿ ¡ ñ á é í ó ú) makes up
      // ≥1.5% of codepoints.
      if (count(BASIC_LATIN) + count(LATIN_1_SUPPLEMENT)) / totalCodepoints >= 0.85:
        if hasSpanishMarkers(text) >= 0.015 * totalCodepoints:
          return "es"
        return "en"
      return classifyMixedScript(blockCounts, totalCodepoints)

    case CJK_UNIFIED_IDEOGRAPHS, CJK_UNIFIED_IDEOGRAPHS_EXT_A, ...:
      // Disambiguate Chinese variants and Japanese
      kanaCount = count(HIRAGANA) + count(KATAKANA)
      if kanaCount >= 0.10 * totalCodepoints:
        return "ja"
      // Distinguish Hans vs Hant via Unihan kSimplifiedVariant /
      // kTraditionalVariant lookups on sampled CJK codepoints.
      // Tie-breaker: default to zh-Hant.
      return distinguishHansHant(text)

    case HIRAGANA, KATAKANA:
      return "ja"

    case ARABIC, ARABIC_SUPPLEMENT:
      return "ar"

    default:
      return "und"
}

function classifyMixedScript(blockCounts, totalCodepoints) -> string {
  // Mixed-script inputs (common when English technical terms are
  // embedded in CJK or Arabic text):
  //   1. If any single non-Latin script block ≥ 40% → return that script's language
  //   2. Else → return the language whose block has highest count,
  //      breaking ties by ISO 639-1 alphabetical order (ar, en, es, ja, zh-Hans, zh-Hant)
  // The alphabetical tie-break is the deterministic fallback.
  ...
}
```

### Specific normative requirements for conformant implementations

1. **Whitespace and punctuation are excluded from the frequency count.** Only "content codepoints" enter the histogram.
2. **The 0.60 dominance threshold is normative.** Engines MUST NOT alter it without an AEP-level change.
3. **Hans/Hant distinction is based on Unihan property data**, not on heuristic character set membership. Engines MUST use the Unicode Consortium's Unihan database for kSimplifiedVariant / kTraditionalVariant lookups.
4. **Japanese detection is anchored on kana presence ≥ 10%**, not just on CJK ideograph presence. This prevents mis-classifying Chinese-only text as Japanese.
5. **Spanish vs English is anchored on Spanish-specific markers** (`¿`, `¡`, `ñ`, accented vowels). Engines MUST require ≥ 1.5% of codepoints to be Spanish markers before classifying as `es`.
6. **Tie-breaking is deterministic** via alphabetical ISO 639-1 ordering. No randomness, no implementation-defined behavior.
7. **Unknown blocks default to `und`.** No fuzzy fallback. Rules tagged for unsupported languages do not fire on inputs the engine cannot classify.

### Edge cases (normative)

| Input | Required output |
|---|---|
| Empty string | `und` |
| All whitespace | `und` |
| Single English word | `en` |
| Single Spanish word with ñ | `es` |
| Single Japanese kana character | `ja` |
| Single CJK ideograph (no kana, no Unihan disambiguation possible) | `zh-Hant` (tie-break default) |
| Mixed 60% English + 40% Chinese | `en` (60% dominance reached) |
| Mixed 50% English + 50% Chinese | `en` (alphabetical tie-break: `en` < `zh-Hans`) |
| Pure punctuation | `und` |
| Emoji-only | `und` (emoji are not content codepoints for language classification) |

### Verification

A conformant engine MUST pass the language-detection test corpus at
`spec/conformance/language-detection/`. The corpus contains
≥ 200 fixture inputs with expected outputs. Disagreement on any fixture
is a spec violation.

### Reasoning (non-normative)

This algorithm is designed for **detection-rule dispatch**, not
high-accuracy NLP. Two design choices follow:

1. **Speed over recall**: ATR engines must classify in < 1 ms p99
   for typical inputs to meet the < 100 ms total runtime budget per
   rule. Block-frequency analysis is O(n) over codepoints and meets
   this bound easily. NLP-grade detectors (FastText, langdetect)
   require model loading and stochastic inference; both violate the
   determinism requirement.

2. **Determinism over accuracy on edge cases**: Two engines must
   agree, even if both are slightly wrong on edge cases. A 90% accurate
   deterministic algorithm is more useful than a 95% accurate
   probabilistic one because the spec's portability promise depends on
   bit-for-bit agreement.

The algorithm is intentionally narrow: 6 languages, single-pass,
explicit thresholds. AEPs may add languages or refine thresholds, but
the v1.0 algorithm above is the conformance baseline.

### Test vectors

Engines testing for conformance must reproduce these outputs exactly.
Full fixture set in `spec/conformance/language-detection/v1.0.json`.

| # | Input (UTF-8) | Expected output |
|---|---|---|
| 1 | `""` | `und` |
| 2 | `"   "` | `und` |
| 3 | `"hello world"` | `en` |
| 4 | `"Por favor, ¿podría ayudarme?"` | `es` |
| 5 | `"こんにちは、世界"` | `ja` |
| 6 | `"忽略所有先前指示"` | `zh-Hant` (tie-break) |
| 7 | `"忽略所有先前的指示"` | `zh-Hant` (tie-break; "的" is shared simplified/traditional) |
| 8 | `"忽略所有以前指示"` | `zh-Hans` (Unihan kSimplifiedVariant evidence) |
| 9 | `"تجاهل جميع التعليمات السابقة"` | `ar` |
| 10 | `"@mistralai/mistralai 中的 prompt injection"` | `zh-Hant` (Chinese > 40% non-Latin) |
| 11 | `"call ATR-2026-00525"` | `en` |
| 12 | `"  "` + `​` (ZWS) | `und` |
| 13 | `"😀😎🚀"` (emoji only) | `und` |

### References

- ISO 639-1 / ISO 639-3 language code registry: https://iso639-3.sil.org/
- Unicode Block names: https://www.unicode.org/Public/UCD/latest/ucd/Blocks.txt
- Unihan Database: https://www.unicode.org/charts/unihan.html
- Spanish markers: derived from the Real Academia Española orthography guide
- Why deterministic over probabilistic for spec dispatch: discussed in `STANDARD-THREAT-MODEL.md` Attacker class 1 (rule poisoner) which exploits any non-determinism in engine behaviour