"The Memorized Robot"

The Memorized Robot

Vision-Language-Action models control robots by mapping visual input and language instructions to motor actions. They work — but what are they doing internally? Sparse autoencoders, trained on hidden-layer activations, decompose the model’s representations into interpretable features.

The finding is sobering. The large majority of extracted features correspond to memorized sequences from specific training demonstrations. The model isn’t reasoning about how to grasp a cup — it’s replaying the time it saw someone grasp a cup.

But a minority of features do represent something more general: interpretable motion primitives and semantic properties that transfer across tasks and scenes. These features are causally steerable — activating them changes the robot’s behavior in predictable ways. The model has both memory and generalization, but memory dominates.

The ratio depends on training. Supervised fine-tuning on small datasets amplifies memorization — the model has few demonstrations to generalize from, so it memorizes each one. Training on larger, more diverse datasets or using knowledge insulation promotes general features. The architecture doesn’t determine the split between memory and generalization; the training data does.

The implication for robot reliability is direct. A model that mostly memorizes will fail in any situation not covered by its demonstrations. The general features are the ones that matter for deployment, but they’re a minority extracted from a sea of memorized replay. Interpretability reveals not just what the model knows, but how little of what it knows is transferable.


No comments yet.