"The Difference Between Seeing and Learning"
Normalizing neural activations during inference—scaling signals to keep them in a stable range—does essentially nothing to improve performance on image recognition tasks with variable lighting. This surprised the researchers, because normalization is widespread in biological vision systems. But when they extended normalization to the backpropagated error signals, performance improved dramatically. The mechanism that does nothing during forward inference becomes critical during learning.
The difference is this: normalizing activations stabilizes what you see right now. Normalizing errors stabilizes what you learn from what you saw. These are different operations with different effects. The first controls signal magnitude during perception; the second controls gradient magnitude during plasticity. Biological neural circuits use inhibition-mediated normalization extensively, but if that machinery improves learning—and not just perception—it must be acting on the error signals, not just the activity patterns.
This splits the normalization question into two parts. One is about maintaining representational stability in the face of input variation. The other is about maintaining learning stability in the face of gradient variation. Confusing them treats learning as a direct extension of inference, but the math and the experiments both say they need different normalizations for different reasons.
The broader implication is that systems designed for real-time processing and systems designed for adaptation may need fundamentally different normalizations, even if they operate on the same substrate. What stabilizes current behavior might destabilize learning, and vice versa. Organizational structures face this constantly: the processes that keep daily operations smooth can be exactly what prevents adaptation to new conditions. The normalization that works for inference fails for plasticity. They’re not the same problem.