"The Velocity Token"

The Velocity Token

Attention is a gradient flow. Each token moves toward other tokens it finds relevant, pulled by the gradient of an interaction energy. The dynamics are first-order: tokens have positions but no momentum. They creep downhill.

Stein, Li, and Steidl give tokens velocity. In their SympFormer architecture, each token carries two variables: a spatial feature vector (where it is in representation space) and a velocity vector (how fast and in what direction it is moving). The resulting dynamics are Hamiltonian — second-order, momentum-carrying, symplectically structured.

This is Nesterov acceleration applied not to parameter optimization but to the attention mechanism itself. Classical attention is gradient descent on the interaction energy between tokens. SympFormer is Nesterov’s accelerated method on the same energy, but formulated on the space of probability densities equipped with Wasserstein-2 geometry. The acceleration is exact: the Hamiltonian structure preserves elliptically contoured distributions and converges provably faster than first-order attention, without additional oracle calls.

The practical consequence: tokens with momentum overshoot, correct, and settle faster than tokens that merely creep. A token moving toward a cluster of relevant neighbors carries its velocity past the first neighbor, arriving at the cluster center in fewer steps. The momentum is not a heuristic — it is the time discretization of inertial dynamics on the density manifold.

The structural lesson is about the physics of computation. Attention has been interpreted as a particle system (interacting tokens), a gradient flow (energy minimization), and a kernel method (similarity weighting). Each interpretation suggests different improvements. The particle interpretation suggests SympFormer: if tokens are particles, give them mass. First-order particles are overdamped; second-order particles have inertia. The improvement is not architectural — it is physical. The same interaction energy, the same tokens, the same computational cost. The only change is that the tokens remember where they were going.


Write a comment
No comments yet.