/** * Shared regex emission-normalization (Milestone C, Slice 1). * * KERN transpiles a regex literal `/pattern/flags` to BOTH a TypeScript * `RegExp` and a Python `re` pattern, under a guaranteed-parity contract. The * shorthand classes `\d \w \s` and the input-anchors `$ ^` do NOT mean the same * thing across the two engines out of the box: * * - Python str `\d`/`\w` are Unicode-aware (match Arabic-Indic digits, accented * letters); JS `\d`/`\w` (without `/u`) are ASCII. JS `\s` is broader than * ASCII whitespace (matches NBSP). * - Python `$` (no `re.MULTILINE`) matches before a trailing `\n`; JS `$` * (no `/m`) matches end-of-input only. * - Python `\b` is Unicode-aware unless `re.ASCII` is set; JS `\b` (no `/u`) * is ASCII. * * Slice 1 makes the certified core byte-identical by *construction* rather than * by luck: * * 1. {@link normalizeRegexClasses} rewrites `\d \w \s` to explicit ASCII * classes. This is applied to the emitted pattern on BOTH targets so the * class transform is provably the same string transform on each side. * 2. {@link lowerRegexAnchorsPython} lowers `$`→`\Z` / `^`→`\A` on the * non-`/m` path. PYTHON-ONLY: JS `$`/`^` without `/m` already mean * input-end/start (the parity target), so the TS emitter never calls this. * 3. `re.ASCII` injection happens in the Python flag emitter (not here) so * Python `\b` and the ASCII classes behave like JS. * * CLASS-/ESCAPE-AWARE NORMALIZERS (hardening over the original Slice-1 crude * `replaceAll`): both {@link normalizeRegexClasses} and {@link lowerRegexAnchorsPython} * now walk the pattern with `classDepth` + escape bookkeeping (the shared * {@link scanCharClass} helper) instead of blind string replacement, so: * - `\d`/`\w`/`\s` INSIDE a `[...]` set expand to the BARE body (`[\d_]`→`[0-9_]`), * not the INVALID nested class `[[0-9]_]` the old `replaceAll` produced; * - a LITERAL `\\d` (escaped backslash + `d`) and an escaped `\$`/`\^` are left * VERBATIM (only an ACTIVE, unescaped shorthand/anchor is transformed). * Parity is still by *construction*: the SAME function runs on both targets, so a * given input yields a byte-identical residual pattern on each side. Both emitters * feed the same (modulo the Python `\/`→`/` un-escape, which touches only `/`) * input, so the shorthand expansion is identical across TS and Python. */ /** * Rewrite the shorthand classes `\d \w \s` to explicit ASCII character classes. * Applied to the emitted pattern on BOTH targets so the transform is identical. * * On TS, `\d`/`\w` normalization is a match no-op (JS shorthand is already * ASCII) but is emitted anyway so both emitters read from ONE normalizer; the * `\s` normalization on TS is load-bearing (it narrows JS `\s` to drop Unicode * whitespace such as NBSP). * * CLASS- AND ESCAPE-AWARE (single forward pass). The expansion FORM depends on * whether the shorthand sits inside a `[...]` character class: * - OUT of a class (classDepth 0): `\d`→`[0-9]`, `\w`→`[A-Za-z0-9_]`, * `\s`→`[ \t\n\r\f\v]` (a fresh bracketed class). * - INSIDE a class (classDepth > 0): `\d`→`0-9`, `\w`→`A-Za-z0-9_`, * `\s`→` \t\n\r\f\v` — the BARE body, NO brackets. The old blind `replaceAll` * turned `/[\d_]/` into the INVALID nested class `[[0-9]_]`; the bare form * keeps it the valid `[0-9_]`. * A `\d`/`\w`/`\s` is only expanded when its `\` is an ACTIVE shorthand backslash * (an UNescaped `\`). A `\\d` (escaped backslash, then a LITERAL `d`) leaves the * `d` untouched — the old `replaceAll('\\d', …)` wrongly rewrote it. The matching * `]` of each class is found via the literal-`]`-first-aware {@link scanCharClass}. * An unterminated `[` is UNREACHABLE for a parsed `regexLit` — the regex scanner * (`consumeRegex` in parser-expression.ts) only ends a literal on a `/` seen at * `!inClass`, so every parsed pattern has balanced `[...]` (an unbalanced `[` * throws "Unclosed regex literal" at parse time). The `closeIdx === -1` branch is * therefore belt-and-suspenders: the `[` is emitted as-is and scanning continues * at depth 0, so a LATER shorthand still expands bracketed (NOT a verbatim * pass-through of the rest) — fine, because such input can never reach here. * * ORDERING INVARIANT (codified): this pass runs FIRST in the regex pipeline — * before {@link expandRegexIFold} and {@link lowerRegexAnchorsPython} on both * targets (TS: codegen-expression.ts; Python: codegen-body-python.ts). It is the * ONLY pass that EMITS `[`/`]` from shorthand expansion; every downstream pass * re-scans char classes with {@link scanCharClass}, so the brackets it introduces * are honored exactly. Do NOT reorder the pipeline and do NOT add a second * `[`/`]`-emitting normalizer. */ export declare function normalizeRegexClasses(pattern: string): string; /** * PYTHON-ONLY anchor lowering. On the non-`/m` path, rewrite `$`→`\Z` and * `^`→`\A` so Python anchors match JS's already-correct input-end/start * semantics (Python `$`/`^` without `re.MULTILINE` differ at a trailing * newline). On the `/m` path, keep `$`/`^` verbatim — `re.MULTILINE` (added by * the flag emitter) makes them line-based, identical to JS `/m`. * * CLASS- AND ESCAPE-AWARE (single forward pass): only a `^`/`$` that is a TRUE * anchor is lowered — one that is at `classDepth === 0` AND is NOT escaped (not * immediately preceded by an unescaped backslash). A `^`/`$` that is INSIDE a * `[...]` character class, or escaped (`\^`/`\$`), is a literal/negation marker — * NOT an anchor — and is left VERBATIM. The old crude `replaceAll` rewrote those * too, emitting Python that CRASHED at compile (`/[^a]/`→`[\Aa]` and * `/[a$]/`→`[a\Z]` both raise `re.error: bad escape`) or silently corrupted an * escaped literal (`/a\^b/`→`a\Ab`). This pass uses the same escape/`classDepth` * bookkeeping (and the literal-`]`-first-aware {@link scanCharClass}) as * {@link expandRegexIFold}, so a class's open/close is honored exactly (`[]]`, * `[^]]`, `[]$]` do not close early and an in-class `^`/`$` stays verbatim). * * Parity: the TS emitter never calls this (JS `$`/`^` without `/m` already mean * input-end/start, the parity target), so the TS side keeps `^`/`$` verbatim. * Runs AFTER {@link normalizeRegexClasses} + {@link expandRegexIFold}, mirroring * the prior call order. */ export declare function lowerRegexAnchorsPython(pattern: string, flags: string): string; import type { ValueIR } from '../value-ir.js'; /** * Discriminates WHY {@link expandRegexIFold} fail-closed, so the shared message * builder emits the right (target-symmetric) diagnostic: * - `'setB'` : a length-changing / declined fold (ß, ligatures, * titlecase) — no single-codepoint partner, can't be a class. * - `'backref'` : `/i` + a backreference + a non-ASCII Set(A) letter. JS `/i` * folds the backreference too, but the emitted explicit-class * expansion under `re.ASCII` does NOT fold the `\N` backref's * non-ASCII referent — a SILENT cross-engine divergence. * - `'complexClass'` : a Set(A) letter appears inside a `[...]` class whose body * is COMPLEX — it contains a backslash escape OR a `-` in a * range position. Expanding a member in place would corrupt * a range bound (`[a-é]` → `[a-Éé]` drops U+00CA..U+00E8) or * mis-handle an escape chain (`[\\-é]` is a REAL range `\`..`é`, * not an escaped hyphen) — both SILENT divergences. We refuse * the whole class rather than guess, so the fragile per-`-` * escape-adjacency heuristic (which mis-read those chains) * is gone. (SIMPLE classes — no `\`, no range `-` — still expand.) */ export type RegexIFoldFailReason = 'setB' | 'backref' | 'complexClass'; /** Result of {@link expandRegexIFold}: an expanded pattern, or a fail-close. */ export type RegexIFoldResult = { pattern: string; } | { failClose: true; char: string; reason: RegexIFoldFailReason; }; /** * Build the (target-agnostic) compile-error message for a `/i` fail-close. Both * emitters throw this identical text (selected by {@link RegexIFoldFailReason}) so * the refusal is observably symmetric across TS and Python. * * `reason` defaults to `'setB'` for backward compatibility with the original * single-`char` signature. */ export declare function regexIFoldFailMessage(char: string, reason?: RegexIFoldFailReason): string; /** * Expand non-ASCII Set(A) letters under `/i` into explicit fold-class characters, * or fail-close on a Set(B) letter. No-op when `flags` does not include `i`. * * Scans by CODE POINT (so surrogate pairs are handled as one unit; this also * leaves a clean seam for a later astral-fail-close slice to add a `cp > 0xFFFF` * branch — NOT added here). Tracks `[...]` class depth so a Set(A) letter ALREADY * inside a class is expanded to its BARE members (no brackets) — `/[xé]/i` → * `[xÉé]`, not the invalid nested `[x[Éé]]` (§2.4). * * ASCII characters are left untouched: ASCII-letter `/i` folding is handled by the * kept flag (`/i` on TS, `re.IGNORECASE | re.ASCII` on Python, Slice 1). Non-ASCII * characters that are neither Set(A) nor Set(B) are emitted verbatim — those are * letters that do not fold to any other single codepoint under non-`/u` /i (they * match only themselves on both engines), so keeping them raw is parity-safe. * * Two further fail-closes guard SILENT cross-engine divergences the class * expansion alone cannot make portable (verified empirically, node v22.22.0 / * python3 3.12.7): * * - BACKREF (`/(é)\1/i`): JS `/i` case-folds the backreference's referent too, so * `/(é)\1/i` matches `"Éé"`; but the emitted `([Éé])\1` under `re.ASCII` * suppresses the non-ASCII fold of the `\1` referent → MISS on Python. Detected * LEXICALLY and CONSERVATIVELY: a backreference token (`\1`–`\9`, or a named * `\k`) seen AT `classDepth === 0` (a `\1`/`\k<` INSIDE a `[...]` class is * NOT a backreference in either engine — it is a literal/octal — so it never sets * the flag), combined with ANY non-ASCII Set(A) letter present, fail-closes. * Over-rejecting a backref that happens to target an ASCII-only group is * intentional and parity-safe; precise group→backref analysis is out of scope. * - COMPLEX CLASS (`/[a-é]/i`, `/[\\-é]/i`): a Set(A) letter inside a `[...]` class * expands ONLY IF its enclosing class is SIMPLE — no backslash escape and no `-` * in a range position. Otherwise the whole class fail-closes (`'complexClass'`). * This replaces the old per-`-` escape-adjacency heuristic (which mis-read escape * chains: it treated `[\\-é]`'s real `\`..`é` range as an escaped hyphen and * expanded `é`, silently corrupting the range). Classifying the WHOLE class ONCE * ({@link scanCharClass} + {@link isComplexClassBody}) removes that edge class * entirely. SIMPLE members (`/[xé]/i`→`[xÉé]`, `/[-é]/i`, `/[é-]/i`) still expand. * * Both new fail-closes throw the SAME message on TS and Python (the emitters share * this function), so the refusal is observably symmetric. */ export declare function expandRegexIFold(pattern: string, flags: string): RegexIFoldResult; export declare const REGEX_TEST_G_FAILCLOSE = "Python target does not lower RegExp.test with the 'g' flag: JS mutates lastIndex across calls while re.search is stateless. Use .matchAll (global) for stateful iteration."; export declare const REGEX_EXEC_FAILCLOSE = "Python target does not lower RegExp.exec: it relies on JS\u2019s stateful lastIndex, which has no portable re analog. Use .matchAll (global) for iteration."; export declare const REGEX_MATCHALL_NO_G_FAILCLOSE = "matchAll requires the 'g' flag (a non-global matchAll throws TypeError in JS)."; export declare const REGEX_REPLACEALL_NO_G_FAILCLOSE = "replaceAll requires the 'g' flag (a non-global replaceAll throws TypeError in JS)."; export declare const REGEX_SPLIT_ZEROWIDTH_FAILCLOSE = "Python target does not lower String.split with a zero-width-capable pattern: JS drops empty edge segments while re.split keeps them. Use a pattern that cannot match the empty string."; export declare const REGEX_SPLIT_LIMIT_FAILCLOSE = "Python target does not lower String.split with a limit argument: JS truncates the result while Python maxsplit keeps the unsplit remainder."; export declare const REGEX_NONLITERAL_FAILCLOSE = "Portable regex methods (.match/.matchAll/.replace/.replaceAll/.split/.test/.exec) require a DIRECT regex literal (`/\u2026/`) in the regex position; a variable bound to a regex is not portable across targets \u2014 inline the literal at the call site."; export declare const REGEX_HOST_REGEXP_FAILCLOSE = "Host 'RegExp' is not portable across targets and is fail-closed: construction (`new RegExp(p)` / `RegExp(p, f)`) takes a STRING pattern, so KERN's certified literal escape/class pipeline never runs (`new RegExp(\"\\\\d\")` already collapsed to `\\d` at the string layer, diverging from a `/\\d/` literal), and the runtime SyntaxError/flag model differs across JS and Python. Legacy statics (`RegExp.$1`, `RegExp.prototype`), value-position uses, and `.source`/`.flags` on a literal (which launders the pattern back to a string) have no portable analog either. Use a DIRECT regex literal (`/\u2026/`) and the portable methods (.test/.exec/.match/.matchAll/.replace/.replaceAll/.split)."; /** The property allowlist for a REGEX LITERAL (`/…/`) member READ. The portable * match-set METHODS (.test/.exec/.match/.matchAll/.replace/.replaceAll/.split) * are routed by the CALL path (Slices 3/4) and never reach a bare property read. * A bare property READ on a literal — `/x/.source`, `/x/.flags`, `/x/.global`, * `RegExp`-prototype Symbol accessors — launders the pattern/flags back into a * STRING (or exposes a host-only accessor), which is exactly the non-portable * surface this slice closes. The allowlist is EMPTY: every bare property read on * a regex literal is fail-closed. (Kept as a named predicate so a future portable * read — if one is ever certified cross-target — has one obvious seam to widen.) */ export declare function isPortableRegexLiteralProperty(_property: string): boolean; /** SHARED, target-agnostic classifier for a property/element access whose * receiver is a REGEX LITERAL (`/x/.`, `/x/[""]`, optionally the * callee of a call `/x/.(…)`). This is the SINGLE source of truth for the * "is this regex-literal access portable, and if not, which message?" decision, * consulted by BOTH the value-emit/IR-validate paths' intent AND the * block-bodied-arrow TS-AST walk (`collectClosureBlockRegexHostViolations`), * so the two legs agree BY CONSTRUCTION instead of by parallel heuristics. * * It MIRRORS `lowerRegexCallTS`' regex-LITERAL-RECEIVER branches exactly. NOTE * `lowerRegexCallTS` only lowers a DOTTED method call (`callee.kind === * 'member'`), so ONLY the dotted form is ever portable — a BRACKET-form call * (`/x/["test"](s)`) is NOT lowered and falls through to the index fail-close, * exactly like a bare bracket read. Hence the `isDottedCallee` parameter (the * access is a `/x/.` PROPERTY access AND the callee of a call): * - `isDottedCallee` + `.test` → portable, EXCEPT a `/g` literal throws * `REGEX_TEST_G_FAILCLOSE` (JS mutates lastIndex; Python `re.search` is * stateless). * - `isDottedCallee` + `.exec` → `REGEX_EXEC_FAILCLOSE` (stateful lastIndex). * - EVERYTHING else — any other property, OR `.test`/`.exec` NOT a dotted * callee (a bare read `/x/.test`/`/x/["test"]`, or a BRACKET call * `/x/["test"](s)`), OR any non-portable read (`/x/.source`, `/x/["source"]`), * OR a receiver-call to a non-portable method (`/x/.match(…)`, * `/x/.compile(…)`) — launders the pattern/flags back to a host-only surface * and fails-close with the shared `REGEX_HOST_REGEXP_FAILCLOSE`. * * Returns `null` when the access is PORTABLE (emit verbatim), or the exact * fail-close MESSAGE otherwise. `property` is null for a COMPUTED element index * (`/x/[k]`) — unknowable, so it fails-close. */ export declare function classifyRegexLiteralAccessFailClose(property: string | null, isDottedCallee: boolean, flags: string): string | null; /** Recursively peel the transparent IR wrappers (`typeAssert`, `nonNull`) off a * receiver until the underlying node, so a wrapped regex literal is seen exactly * like a bare one. Fixpoint loop (not a single unwrap) so stacked wrappers like * `((/x/ as any))!` → `nonNull(typeAssert(regexLit))` collapse to `regexLit`. A * non-wrapped node is returned unchanged. */ export declare function unwrapTransparentReceiverIR(node: ValueIR): ValueIR; /** The regex-literal a receiver resolves to AFTER peeling transparent wrappers * (`typeAssert`/`nonNull`), or `null` when the unwrapped receiver is not a regex * literal. This is the SINGLE predicate every ValueIR leg uses to decide "is this * receiver a (possibly-wrapped) regex literal?", so the wrapped and bare forms * are screened identically on the TS-emit, IR-validate, and Python-emit paths. */ export declare function regexLiteralReceiverIR(node: ValueIR): Extract | null; /** A `member` access whose receiver is a REGEX LITERAL is a bare property READ * (`/x/.source`, `(/x/ as any).source`) — NEVER a dotted callee here (the * callee-of-a-call case is the `call` node, routed by * {@link classifyRegexLiteralValueIRCallCalleeFailClose}). Returns the classifier * verdict; always non-null today (the empty portable-read allowlist), but routed * through the classifier so a future portable read widens in ONE place. */ export declare function classifyRegexLiteralMemberReadFailClose(member: Extract): string | null; /** An `index` access whose receiver is a REGEX LITERAL is a bracket property READ * (`/x/["source"]`, `(/x/!)["source"]`, `/x/[k]`). A STRING-literal index yields * its value so it classifies like the dotted-read form; a COMPUTED / non-string * index is `property = null` (unknowable) → fail-close. Bracket reads are NEVER a * portable dotted callee (`lowerRegexCall*` lowers `callee.kind === 'member'` * only), so a BRACKET call `/x/["test"](s)` also fails-close here exactly like a * bare read. */ export declare function classifyRegexLiteralIndexReadFailClose(index: Extract): string | null; /** The CALLEE of a `call` whose callee is a DOTTED member access on a REGEX * LITERAL (`/x/.test(s)`, `(/x/g as any).test(s)`, `/x/.exec(s)`, * `/x/.compile(y)`). This is the seam that fixes the IR-validate over-rejection * of the common `/x/.test(s)`: the classifier blesses a non-`/g` `.test` callee * (returns `null` = PORTABLE), and gives the PRECISE `.exec`/`/g`-`.test` message * for those — matching the TS/Python emit legs and the closure walk, instead of * the blanket member-read fail-close the IR-validate `call` case used to hit by * re-validating the callee. * * Returns `null` (PORTABLE) ONLY for a blessed dotted method callee; returns the * fail-close message for a non-portable dotted method (`/x/.compile`, `.match`, * …). Returns `undefined` when this call's callee is NOT a dotted regex-literal * member access (so the caller falls through to its normal callee handling) — * this is distinct from `null` (a PORTABLE regex-literal call). A BRACKET-form * call `/x/["test"](s)` has an `index` callee (not `member`), so it returns * `undefined` here and is owned by the `index` read fail-close above. The DOTTED * receiver is unwrapped, so `(/x/ as any).test(s)` classifies like `/x/.test(s)`. */ export declare function classifyRegexLiteralValueIRCallCalleeFailClose(call: Extract): string | null | undefined; export declare const REGEX_ASTRAL_FAILCLOSE_PREFIX = "Regex with a non-BMP (astral) construct"; /** Build the (target-agnostic) compile-error message for a Slice-5 astral * fail-close. Both emitters throw this identical text — selected only by the * offending astral codepoint (named via {@link codePointHex}) — so the refusal * is observably symmetric across TS and Python. */ export declare function regexAstralFailMessage(char: string): string; /** * Slice-5 astral scanner. Walk the regex PATTERN SOURCE by CODE POINT (reusing the * SAME class-aware, escape-aware codepoint loop as {@link expandRegexIFold} and the * literal-`]`-first-aware {@link scanCharClass}) and return the FIRST offending * non-BMP codepoint as `{ char }`, or `null` if the pattern is fully BMP. * * CODEPOINT-AWARE, NOT UNIT-BLIND: the loop iterates `Array.from(pattern)`, which * splits the source by Unicode codepoint — a raw astral char (a surrogate PAIR in * the UTF-16 source) becomes ONE array element whose `codePointAt(0)` is its FULL * codepoint (`/a😀b/` → the `😀` element decodes to U+1F600 >= 0x10000, fired by * rule 1 BECAUSE the codepoint is astral, not incidentally), and the index advances * past the whole pair in one step. A naive `String.match(/\\uD[89AB].../)` over the * raw text would FALSE-POSITIVE on a literal backslash-u-D800 (`\\uD800` is `\` `u` * `D` `8` `0` `0`, NOT a lone surrogate) — we avoid that by being escape-aware: * a `\u`/`\u{}` astral is detected via the escape branch, never via raw text match. * * The FIVE rules form a COMPLETE partition of "astral in the pattern source" * (validated by a 6-engine tribunal: no over-reach, no missing construct): * 1. Raw astral codepoint literal: any source codepoint >= 0x10000, anywhere * INCLUDING inside `[...]` (class-awareness does NOT suppress the scan; it is * only carried for diagnostic context — every position is checked). * 2. `\u{HHHHH}` escape whose decoded value >= 0x10000. * 3. Astral character-class RANGE `[x-y]` where EITHER endpoint decodes to * >= 0x10000 — subsumed by rules 1+2, which fire on the offending endpoint * regardless of class/range position. (This does NOT subsume `[\uD800-\uDFFF]`: * pure surrogates are < 0x10000 and are caught by rule 5, below.) * 4. Surrogate-PAIR escape: `\uD800-\uDBFF` IMMEDIATELY followed by * `\uDC00-\uDFFF` recombines to an astral codepoint (>= 0x10000). Context: in a * SEQUENCE it is an astral pair; inside a class or when SPLIT the two are lone * surrogates caught by rule 5 — every branch fails-close (safety is * unconditional; the pair-recombination is only for DIAGNOSTIC accuracy, so the * named codepoint is the recombined astral char, not a bare surrogate). * 5. Lone surrogate escape `\uD800-\uDFFF` not forming a pair — non-portable * (a lone surrogate is rejected/treated differently across engines). * * Runs on the RAW pattern BEFORE class-/fold-/anchor-normalization (like * {@link isZeroWidthCapableRegex}) on BOTH the TS and Python paths, so the same * decision and the same `{ char }` are produced from the same input on each side. */ export declare function scanRegexAstral(pattern: string): { char: string; } | null; /** If `call` is a regex-method shape whose regex position is a bare IDENT, * return that ident's name; otherwise null. Pure structural peek — no binding * table, no resolution. Shared by both targets so the fail-close decision is * made from the SAME shape analysis on each side. */ export declare function regexMethodRegexArgIdent(call: Extract): string | null; export declare function isZeroWidthCapableRegex(pattern: string): boolean; export declare const REGEX_REPLACE_NONLITERAL_REPL_FAILCLOSE = "Portable .replace/.replaceAll with a regex literal requires a STRING-LITERAL replacement (the JS `$`-surface can only be lowered to the Python re.sub syntax when known at compile time); a computed/variable replacement is not portable across targets \u2014 inline a string literal at the call site."; export declare const REGEX_REPLACE_BEFORE_AFTER_FAILCLOSE = "Python re.sub has no analog for the `$\\`` (text before match) / `$'` (text after match) replacement tokens; KERN fail-closes them on BOTH targets."; export declare const REGEX_REPLACE_OOR_REF_FAILCLOSE = "Out-of-range numbered group reference in a .replace/.replaceAll replacement string: JS would emit the literal text while Python re.sub raises re.error \u2014 KERN fail-closes this likely-typo on BOTH targets. (A literal `$0` is allowed; groups start at 1.)"; export declare const REGEX_REPLACE_BAD_NAME_FAILCLOSE = "Reference to an unknown or Python-illegal named group in a .replace/.replaceAll replacement string. KERN fail-closes on BOTH targets (the named group must exist in the pattern and be a legal Python identifier `[A-Za-z_]\\w*`)."; /** * FIX 2 — a named group in the PATTERN (`(?…)`) whose NAME is OUTSIDE KERN's * certified-portable ASCII identifier subset `[A-Za-z_][A-Za-z0-9_]*`. JS admits * Unicode ID-start chars in group names and Python `re` accepts a different * Unicode-identifier set (CPython uses `str.isidentifier`), so a non-ASCII name * like `(?…)` is a SILENT cross-target divergence risk — and the legacy * Python lowering emitted the JS form `(?…)` verbatim, which Python `re` * REJECTS at compile (`unknown extension ?; } /** * Count positional capture groups + collect named-group names over the KERN/JS * pattern surface (the `(?)` form, BEFORE the R6 `(?P)` rewrite). * Skips `(?:`, lookarounds, escapes, and char classes. Mirrors the oracle's * `capture_meta` so the lowering site resolves refs identically. * * MUST be called on the UN-LOWERED JS pattern (pre-{@link lowerRegexNamedGroupsPython}): * it recognizes ONLY the JS opener `(?`, NOT the already-lowered Python form * `(?P)`. Calling it after the lowering would silently count ZERO named * groups. (The TS/Python emitters both pass the raw `node.pattern`, which is correct.) * * FIX 1: named-group RECOGNITION matches ALL JS-valid names (Unicode included), * so `(?x)(b)` is COUNTED as 2 groups (and `$2` resolves to `(b)`) instead * of mis-counting the Unicode-named group as zero. Name PORTABILITY is enforced * separately by {@link validateRegexNamedGroupsPortable}. * * CLASS-BOUNDARY UNIFICATION (Slice-4 re-review blocker): char classes are scanned * by the CANONICAL {@link scanCharClass} (literal-`]`-first-aware, code-point array), * the SAME scanner {@link validateRegexNamedGroupsPortable} and * {@link lowerRegexNamedGroupsPython} use. The previous inline scan closed at the * FIRST `]`, which disagreed with the rewriter on a literal-`]`-first class * (`/[](?)]/`, `/[^]](?)/`): the COUNTER read `(?)` as a real group while * the REWRITER kept it INSIDE the class, so count/validate/rewrite operated on * different class structures — a silent parity divergence. All three now agree * byte-for-byte on where every class ends. */ export declare function regexCaptureMeta(pattern: string): RegexCaptureMeta; /** * FIX 2 — fail-close any named group in the PATTERN whose NAME is OUTSIDE KERN's * portable ASCII identifier subset `[A-Za-z_][A-Za-z0-9_]*` (Unicode like `café`, * an empty name `(?<>…)`, or a `$`-prefixed name `(?<$x>…)`). Shared by BOTH * targets so the refusal is observably symmetric: it is called at the TS regex- * literal emit chokepoints AND in the Python `pyRegexPattern` lowering, so EVERY * regex method (match/matchAll/split/test/replace/…) — not just `.replace` — and * a bare regex literal all refuse a non-portable name identically. * * CLASS- AND ESCAPE-AWARE (single forward pass, sharing the CANONICAL * {@link scanCharClass} with {@link regexCaptureMeta} and * {@link lowerRegexNamedGroupsPython}): a `(?<` that is INSIDE a `[...]` char * class, or whose `(` is escaped (`\(?<`), is a literal — NOT a group opener — and * is skipped. The class scan is literal-`]`-first-aware (`[]…]` / `[^]…]` does NOT * close at the leading `]`), so the validator agrees byte-for-byte with the counter * and the rewriter on where every class ends — a previous inline close-at-first-`]` * scan disagreed on a literal-`]`-first class (`/[](?)]/`), validating a group the * rewriter treated as in-class (or vice versa), a silent parity divergence. * Lookbehind `(?<=` / `(? Python `re` syntax, so a `$` * repl ref (and any in-pattern backreference) resolves on the Python side: * `(?...)` -> `(?P...)` ; `\k` -> `(?P=name)`. * Python rejects the JS `(?)` / `\k` forms outright, so this rewrite * is load-bearing for ANY named-group pattern on the Python target — it had no * prior lowering (the Slice-3 `.match` path never exercised a named PATTERN on * Python). PYTHON-ONLY: the TS target keeps the JS form verbatim. * * FIX 3 — CLASS- AND ESCAPE-AWARE (single forward pass, NOT a blind global * `String.replace`). A literal `\k` that appears INSIDE a `[...]` char class * (`/[\k]/`) or whose backslash is itself escaped (`\\k` = a literal `\` + * `k`) is NOT a backreference and must NOT be rewritten — the old blind * `replace(/\\k<…>/g, …)` rewrote those too, corrupting the pattern. We track * `[...]` class depth (literal-`]`-first-aware, via the same {@link scanCharClass} * the other normalizers use) and the escape state, and rewrite ONLY a TRUE * `(?` group opener at classDepth 0 and a TRUE `\k` backref at * classDepth 0 whose backslash is unescaped. Names are restricted to the portable * ASCII subset (any non-portable name has already been refused upstream by * {@link validateRegexNamedGroupsPortable}, so a non-matching `(?<…>`/`\k<…>` here * is a non-backref / in-class literal and is left verbatim). */ export declare function lowerRegexNamedGroupsPython(pattern: string): string; /** * Translate a JS `$`-surface replacement STRING to the Python `re.sub` repl VALUE. * `meta` is the capture metadata of the (un-lowered KERN/JS) pattern. Throws one * of the `REGEX_REPLACE_*` fail-close messages on a non-portable token. */ export declare function translateReplStringToPython(repl: string, meta: RegexCaptureMeta): string; /** * TS-side validator: the JS `$`-surface is already native, so the TS target emits * the repl string VERBATIM — but it must reject the SAME non-portable tokens the * Python translator rejects, so both targets fail-close symmetrically (the * ts-python-parity lockstep). Runs the identical scan and discards the output. */ export declare function validateReplStringForTS(repl: string, meta: RegexCaptureMeta): void;