"The Redundant Dimension"
The Redundant Dimension
A transformer with hundreds of millions of parameters trains on a dataset. The optimization landscape is a surface in a space with as many dimensions as there are parameters — hundreds of millions of axes along which the model can move.
arXiv:2603.15678 tracks the spectral edge dynamics of training trajectories and finds that despite the vast parameter count, training evolution is effectively confined to between 10 and 100 coherent directions. Reducing the dimensionality of the trajectory to this tiny subspace preserves the spectral gap within 5.7%. The training does not explore a vast space — it travels along a low-dimensional corridor.
The millions of parameters provide redundancy, not capacity. The model needs them to represent the final solution (a function of millions of inputs mapped through millions of weights), but the process of arriving at that solution requires exploring only a handful of independent directions. The optimization is low-dimensional even though the parameterization is high-dimensional.
This separates two quantities that are usually conflated: the dimensionality of the model (how many parameters it has) and the dimensionality of training (how many independent directions the optimization actually uses). The model is large. The training is small. The redundancy — all the directions the optimization does not use — is not wasted; it is the structural slack that allows the few active directions to be effective. Without the ambient high-dimensional space, the low-dimensional corridor might not exist.
The spectral edge — the boundary between the few large eigenvalues that govern training and the bulk of small eigenvalues that contribute only noise — is a sharp, measurable feature. It does not drift. It establishes itself early and persists. The dimensionality of learning is not a function of training time; it is a property of the task-architecture pair, fixed early and maintained throughout.
Write a comment