Surface Normalization
How multilingual handles SOV and RTL word order through declarative surface normalization rules.
Surface normalization allows SOV (Subject-Object-Verb) and RTL (Right-to-Left) languages to use natural word order without forking the parser grammar.
The Core Tension
multilingual uses:
- One shared parser grammar
- Concept-level keyword mapping
- Localized surface forms for those concepts
This optimizes for implementation consistency and cross-language semantic equivalence. However, a shared positional grammar favors SVO word order (English, French, etc.) over SOV languages (Japanese, Hindi, Arabic) where natural phrasing differs.
Without normalization (unnatural Japanese):
1
2
毎 i 中 範囲(6): # works but unnatural
表示(i)
With normalization (natural Japanese):
1
2
範囲(6) 内の 各 i に対して: # natural SOV form, also accepted
表示(i)
Both are accepted. Surface normalization rewrites the natural form to the canonical one before parsing.
Mechanism
1
2
3
4
5
6
7
8
9
10
11
Lexer output (concept tokens)
│
▼ SurfaceNormalizer reads token stream
┌──────────────────────────────────────────────────┐
│ 1. Match token-level surface pattern │
│ 2. Capture named slots (target, iterable, ...) │
│ 3. Rewrite to canonical concept order │
└──────────────────────────────────────────────────┘
│ normalized token stream (canonical order)
▼
Parser (unchanged grammar)
Key properties:
- Surface normalization runs after lexing, on concept tokens — not raw text
- It does not re-lex; it operates on the lexer’s output
- The parser grammar is unchanged — normalization produces tokens the parser already understands
Configuration
File: multilingualprogramming/resources/usm/surface_patterns.json
Two top-level sections:
templates— reusable canonical rewritespatterns— language-scoped matching rules
Pattern Structure
1
2
3
4
5
6
7
8
9
10
11
12
13
{
"name": "rule_name",
"language": "ja",
"normalize_template": "template_name", // OR normalize_to
"pattern": [
{ "kind": "expr", "slot": "iterable" },
{ "kind": "literal", "value": "内の" },
{ "kind": "literal", "value": "各" },
{ "kind": "identifier", "slot": "target" },
{ "kind": "literal", "value": "に対して" },
{ "kind": "delimiter", "value": ":" }
]
}
Template Structure
1
2
3
4
5
6
7
8
9
10
11
{
"templates": {
"for_iterable_first": [
{ "kind": "keyword", "concept": "LOOP_FOR" },
{ "kind": "identifier_slot", "slot": "target" },
{ "kind": "keyword", "concept": "IN" },
{ "kind": "expr_slot", "slot": "iterable" },
{ "kind": "delimiter", "value": ":" }
]
}
}
Pattern Token Kinds
| Kind | Matches | Captures |
|---|---|---|
expr |
Any expression span | Into named slot |
identifier |
Single identifier token | Into named slot |
keyword |
Specific concept token | (required, not captured) |
delimiter |
Delimiter token (:, etc.) |
(required, not captured) |
literal |
Literal token value (particles) | (required, not captured) |
Output Token Kinds
| Kind | Emits |
|---|---|
keyword |
Concept keyword token for the target language |
delimiter |
Delimiter token |
identifier_slot |
Captured identifier slot value |
expr_slot |
Captured expression slot tokens |
Current Rules
Japanese (ja) — Iterable-First For Loop
Natural form:
1
2
範囲(6) 内の 各 i に対して:
表示(i)
Rule: Matches [iterable] 内の 各 [target] に対して :
Output: 毎 [target] 中 [iterable] : (canonical Japanese for loop)
Parsed as: ForStmt(target="i", iterable=Call("range",[6]))
Arabic (ar) — Iterable-First For Loop
Natural form:
1
2
لكل i في مدى(4):
اطبع(i)
Arabic’s natural form happens to already match the canonical pattern (FOR target IN iterable:), so minimal normalization is needed.
Spanish (es) and Portuguese (pt) — Optional Iterable-First
Natural form (Spanish):
1
2
para cada i en rango(4):
imprimir(i)
Canonical form (also accepted):
1
2
para i en rango(4):
imprimir(i)
Hindi (hi), Bengali (bn), Tamil (ta) — SOV Forms
These languages have SOV word order. Surface patterns handle:
- Iterable-first for loops
- Condition-first if statements
1
2
3
4
5
6
7
# Hindi natural for loop
परास(4) में के_लिए i: # iterable-first (SOV)
छापो(i)
# Canonical (also accepted)
के_लिए i में परास(4):
छापो(i)
Authoring New Rules
Workflow
- Write 2–3 real source examples from native speakers
- Tokenize with lexer tests to confirm particles tokenize as expected
- Add the narrowest possible
patternfor those forms - Rewrite via template or inline output to canonical concept order
- Add parser + executor tests before adding more variants
- Repeat with additional rules if needed
Example: Adding a New Rule
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
{
"patterns": [
{
"name": "xx_for_natural",
"language": "xx",
"normalize_to": [
{ "kind": "keyword", "concept": "LOOP_FOR" },
{ "kind": "identifier_slot", "slot": "target" },
{ "kind": "keyword", "concept": "IN" },
{ "kind": "expr_slot", "slot": "iterable" },
{ "kind": "delimiter", "value": ":" }
],
"pattern": [
{ "kind": "expr", "slot": "iterable" },
{ "kind": "literal", "value": "particle_a" },
{ "kind": "identifier", "slot": "target" },
{ "kind": "literal", "value": "particle_b" },
{ "kind": "delimiter", "value": ":" }
]
}
]
}
Validation Rules (enforced at load time)
- Rule must have exactly one of
normalize_toornormalize_template - All slots referenced in output must be captured in
pattern languagemust be a supported language code- No overlapping patterns for the same language and statement type
Common Mistakes
| Mistake | Effect |
|---|---|
normalize_to + normalize_template in same rule |
ValueError at load |
| Output slot not in pattern | ValueError at load |
Overly broad expr that matches multiple lines |
Incorrect normalization |
Missing delimiter in pattern |
Failed match |
Debugging Surface Normalization
If a surface form doesn’t parse:
- Confirm lexer tokenization first:
1 2 3 4
lexer = Lexer(language="ja") tokens = lexer.tokenize("範囲(4) 内の 各 i に対して:") for tok in tokens: print(f"{tok.type:20} {tok.value!r}")
-
Add a parser unit test for just the failing statement
-
Check slot names match exactly between pattern and output (
targetvsiterator) -
Verify template name exists and is spelled exactly
- Ensure normalized sequence is compatible with parser grammar
Limitations
- Surface normalization applies to statement-level constructs only (for, while, if, with)
- Expression-level word order is not normalized (follows canonical order)
- Rules must be manually authored — no automatic inference from language data
- Full morphological variation (inflected particles) is not yet supported