"The Orthogonal Context"
The Orthogonal Context
A single audio frame in a self-supervised speech model encodes three phones simultaneously: the previous phone, the current phone, and the next phone. This has been suspected from attention patterns and probing experiments. What Choi et al. show is the geometry: the three contributions occupy orthogonal subspaces within the representation vector.
Not approximately orthogonal. Not nearly orthogonal. Orthogonal enough that you can project along each subspace independently and recover each phone’s identity.
The through-claim: the model solved the context-encoding problem by discovering a spatial decomposition. Rather than blending past, present, and future into an entangled representation (which would require a decoder to untangle), it allocated perpendicular dimensions to each temporal position. Context is superposed without interference because the directions are chosen to be non-interfering.
This is compositional — the representation is the direct sum of its parts — which is unexpected from a model trained with no explicit instruction to decompose phonetic context. The training objective (masked prediction) doesn’t mention orthogonality. The model arrived at it because it’s the most efficient way to store three overlapping signals in a fixed-dimensional space without losing any of them.
Phonetic boundaries are implicit in this structure. Where the subspace content changes, a boundary occurred. The model doesn’t need a boundary detector — the geometry encodes boundaries as discontinuities in the subspace decomposition.
Write a comment