Adding a New Language
Data-driven guide to adding a new language frontend to multilingual.
multilingual is designed so new programming languages can be added primarily by updating data files (JSON), not parser or codegen logic.
Language onboarding follows a controlled-language policy: add deterministic, testable surface forms only. See CNL Scope.
Overview
To add language xx, you need to:
- Add keyword mappings in
keywords.json - Add parser/semantic error messages in
error_messages.json - Add REPL localization in
commands.json - (Optional) Add operator description localization in
operators.json - (Optional) Add built-in aliases in
builtins_aliases.json - (Optional) Add surface syntax patterns in
surface_patterns.json - Add tests
- Update documentation
Step 1: Add Keyword Mappings
File: multilingualprogramming/resources/usm/keywords.json
- Add the new language code to
"languages"array - For every concept in every category, add a translation for the new language
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
{
"languages": ["en", "fr", "es", ..., "xx"],
"keywords": {
"control_flow": {
"COND_IF": {
"en": "if",
"fr": "si",
"xx": "your_if_keyword"
},
"COND_ELIF": {
"en": "elif",
"fr": "sinonsi",
"xx": "your_elif_keyword"
}
}
}
}
Requirements:
- All 51 concepts must have a translation (completeness validation enforced at load time)
- Prefer unique tokens per language to avoid ambiguity
- Keep tokens identifier-safe (letters/underscores, no spaces)
- Avoid keywords that conflict with builtin names in your language
Why this is enough:
KeywordRegistryloads this file dynamicallyLexerdetects keywords throughKeywordRegistryParserconsumes concept tokens — syntax support follows automaticallyRuntimeBuiltinsmaps builtins from concept IDs — execution picks up automatically
Step 2: Add Error Messages
File: multilingualprogramming/resources/parser/error_messages.json
For each message key under messages, add the new language translation. Use the same placeholders as existing translations.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
{
"messages": {
"unexpected_token": {
"en": "Unexpected token '{token}' at line {line}",
"fr": "Jeton inattendu '{token}' à la ligne {line}",
"xx": "Your translation with {token} and {line} placeholders"
},
"undefined_variable": {
"en": "Variable '{name}' is not defined",
"fr": "La variable '{name}' n'est pas définie",
"xx": "Your translation with {name} placeholder"
}
}
}
Why: ErrorMessageRegistry.format() reads this file dynamically and the parser/semantic analyzer use it for diagnostics.
Step 3: Add REPL Localization
File: multilingualprogramming/resources/repl/commands.json
Update:
help_titlesfor the new language- Message keys (
keywords_title,symbols_title,unsupported_language) commands.<name>.aliases— command words in the languagecommands.<name>.descriptions— help text
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
{
"help_titles": {
"en": "Available commands:",
"fr": "Commandes disponibles:",
"xx": "Your language: Available commands"
},
"commands": {
"language": {
"aliases": {
"en": [":language", ":lang"],
"fr": [":langue"],
"xx": [":your_alias"]
},
"descriptions": {
"en": "Switch language",
"fr": "Changer de langue",
"xx": "Your description"
}
}
}
}
Step 4: Operator Localization (Optional)
File: multilingualprogramming/resources/usm/operators.json
Add the new language under description for each operator:
1
2
3
4
5
6
7
8
9
10
11
{
"operators": {
"+": {
"description": {
"en": "Addition",
"fr": "Addition",
"xx": "Your language term for addition"
}
}
}
}
Used by the REPL :ops command. Falls back to English if not provided.
Step 5: Built-in Aliases (Optional)
File: multilingualprogramming/resources/usm/builtins_aliases.json
Add localized aliases for selected universal builtins:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
{
"builtins": {
"print": {
"en": "print",
"fr": "afficher",
"xx": "your_print_word"
},
"range": {
"en": "range",
"fr": "intervalle",
"xx": "your_range_word"
},
"len": {
"en": "len",
"fr": "longueur",
"xx": "your_length_word"
}
}
}
41 builtin concepts have localization support. The universal English name always remains available — aliases are additive.
Step 6: Surface Syntax Patterns (Optional)
File: multilingualprogramming/resources/usm/surface_patterns.json
Use this when keyword translation alone is insufficient for natural phrasing. Typical use cases:
- Iterable-first
forheaders (SOV languages like Japanese, Hindi) - Language-specific particles around loop/condition clauses
- RTL languages (Arabic)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
{
"templates": {
"for_iterable_first": [
{ "kind": "keyword", "concept": "LOOP_FOR" },
{ "kind": "identifier_slot", "slot": "target" },
{ "kind": "keyword", "concept": "IN" },
{ "kind": "expr_slot", "slot": "iterable" },
{ "kind": "delimiter", "value": ":" }
]
},
"patterns": [
{
"name": "xx_for_iterable_first",
"language": "xx",
"normalize_template": "for_iterable_first",
"pattern": [
{ "kind": "expr", "slot": "iterable" },
{ "kind": "literal", "value": "particle_a" },
{ "kind": "identifier", "slot": "target" },
{ "kind": "literal", "value": "particle_b" },
{ "kind": "delimiter", "value": ":" }
]
}
]
}
Pattern token kinds:
expr— capture expression span into a slotidentifier— capture single identifier tokenkeyword— require a specific concept tokendelimiter— require a delimiter token (:, etc.)literal— require a literal token (particles like内の,ضمن)
Output kinds (normalize_to or template):
keyword— emit a concept keyword tokendelimiter— emit a delimiter tokenidentifier_slot— emit captured identifierexpr_slot— emit captured expression
Step 7: Add Tests
Minimum recommended tests:
1. Keyword Registry Tests
1
2
3
4
5
6
7
8
# tests/keyword_registry_test.py
def test_xx_in_supported_languages():
registry = KeywordRegistry()
assert "xx" in registry.get_supported_languages()
def test_xx_cond_if():
registry = KeywordRegistry()
assert registry.get_keyword("COND_IF", "xx") == "your_if_keyword"
2. End-to-End Executor Test
1
2
3
4
5
6
7
8
9
10
# tests/executor_test.py
def test_xx_basic_program():
executor = ProgramExecutor()
source = """
your_let_keyword x = 5
your_let_keyword y = 10
your_print_word(x + y)
"""
output = executor.execute(source, language="xx")
assert "15" in output
3. Error Messages Test
1
2
3
4
5
# tests/error_messages_test.py
def test_xx_has_all_error_messages():
registry = ErrorMessageRegistry()
for key in registry.get_all_message_keys():
assert "xx" in registry.get_languages_for_message(key)
4. Runtime Builtins Test
1
2
3
4
5
# tests/runtime_builtins_test.py
def test_xx_print_alias():
builtins = RuntimeBuiltins.for_language("xx")
assert "your_print_word" in builtins
assert builtins["your_print_word"] is print
5. Surface Normalizer Test (if adding patterns)
1
2
3
4
5
6
7
8
9
# tests/surface_normalizer_test.py
def test_xx_surface_config_valid():
# Ensure config stays schema-valid
normalizer = SurfaceNormalizer()
assert normalizer.validate_config()
def test_xx_for_loop_normalized():
# Test that your surface form normalizes correctly
...
Step 8: Update Documentation
At minimum:
README.md— add to supported languages listdocs/reference.md— add to supported languages table- Link this onboarding guide where relevant
Validation Commands
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Full test suite
python -m pytest -q
# Focused validation for new language
python -m pytest -q \
tests/keyword_registry_test.py \
tests/error_messages_test.py \
tests/executor_test.py \
tests/repl_test.py
# Surface normalizer (if patterns added)
python -m pytest -q \
tests/surface_normalizer_test.py \
tests/parser_test.py \
tests/executor_test.py
# Language smoke tests
python -m multilingualprogramming smoke --lang xx
python -m multilingualprogramming smoke --all
# Lint
python -m pylint $(git ls-files '*.py')
Common Mistakes
- Missing concepts: All 51 concepts must have translations. Validation fails at load time for incomplete packs.
- Ambiguous keywords: Avoid keywords that are shared with existing languages if they have different meanings.
- Reserved words: Don’t use keywords that conflict with Python built-in names that are not localized.
- Surface pattern errors: Defining both
normalize_toandnormalize_templatein one rule →ValueError - Missing slot references: Capturing a slot in output that was never captured in the pattern →
ValueError
Checklist Template
Use docs/templates/language_pack_checklist.md when opening a PR for a new language pack:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
## Language Pack: [Language Name] ([code])
- [ ] keywords.json updated (all 51 concepts)
- [ ] error_messages.json updated (all message keys)
- [ ] commands.json updated (REPL commands)
- [ ] operators.json updated (optional)
- [ ] builtins_aliases.json updated (optional)
- [ ] surface_patterns.json updated (optional, with tests)
- [ ] Keyword registry tests passing
- [ ] Executor end-to-end test passing
- [ ] Error messages test passing
- [ ] Runtime builtins test passing
- [ ] Smoke test passing: `multilingual smoke --lang xx`
- [ ] All 858 existing tests still passing
- [ ] README.md updated
- [ ] docs/reference.md updated