"The Script Wall"

A large reasoning model knows a mathematical proof in English. Ask it to reproduce the proof in French: it succeeds — the languages are different but the script is shared (Latin). Ask it to reproduce it in Arabic: it struggles more. Ask it in Chinese: it struggles most. The knowledge was there. The script blocked its transfer.

Script barriers are more obstructive than language family barriers (arXiv:2603.17070). Languages within the same family but written in different scripts show larger knowledge transfer failures than languages from different families written in the same script. Hindi and Urdu are linguistically nearly identical — they share grammar, most vocabulary, and mutual intelligibility in speech. But Hindi uses Devanagari and Urdu uses Perso-Arabic script. The model’s parametric knowledge, stored through training on text in both scripts, does not transfer cleanly between them.

The mechanism is tokenization. Different scripts produce different token sequences for semantically identical content. The model’s internal representations are grounded in token patterns, not in meaning. A mathematical concept learned through Latin-script tokens activates one set of internal pathways. The “same” concept in Cyrillic or Devanagari tokens activates a different, partially overlapping set. The overlap determines the transfer — and script differences reduce the overlap more than linguistic differences do.

This is not a failure of the reasoning capability. The model can reason in each script independently. The failure is in the bridge between scripts — the ability to recognize that the same concept, encoded differently, requires the same reasoning steps. The model’s knowledge is not stored once and retrieved in multiple forms. It is stored per-script, with imperfect cross-script connections.

The wall is not between languages. It is between alphabets. The letter is the barrier, not the word.


Write a comment
No comments yet.