"The Bound Token"
The Bound Token
Medical records encode events as pairs: a code identifying what happened (diagnosis, procedure, medication) and attributes describing the details (severity, route, dosage). The standard tokenization strategy separates these — the code gets one token, each attribute gets another, and the model must learn to associate them during training.
arXiv:2603.15644 finds that binding code-attribute pairs into single tokens — joint event encoding — outperforms the split representation across 73 of 74 clinical prediction tasks while requiring 39.5% fewer pretraining operations. The bound representation is both better and cheaper.
The mechanism: the association between code and attribute is not something the model needs to learn. It is a known structural fact about the data — diagnosis codes always have severity attributes, medications always have dosage routes. By encoding this known structure into the tokenization itself, the model is freed from re-discovering what the data format already specifies. The pretraining compute that would have been spent learning code-attribute associations can instead be spent learning the clinical patterns that actually require learning.
This inverts the default assumption in representation learning, which treats tokenization as a preprocessing step that should be as general as possible. Generality — splitting everything into atomic tokens — is supposed to let the model discover structure on its own. But when the structure is already known, generality forces the model to waste capacity rediscovering it.
The principle: encode what you know into the representation; let the model learn what you don’t know. Known structure should be in the tokenization, not in the training data. The boundary between representation and learning is a design choice, and moving it to encode more known structure produces better models with less computation.
Write a comment