"The Waste Channel"

When a neural network perfectly fits noisy training data, it has memorized the noise. By classical learning theory, this should catastrophically degrade test performance. The model has learned the noise as if it were signal, and its predictions on new data should reflect this confusion.

Double descent shows this does not happen. Test performance initially degrades as the model begins to overfit, then recovers and improves as the model capacity grows further or training continues. The network memorizes the noise and generalizes well, simultaneously.

The mechanism is not tolerance. The network does not “put up with” noise or average it out. It actively segregates signal from noise using extreme activations.

A single very large activation emerges in the shallow layer across all tested models during the double descent regime. This activation grows as training progresses through the overfitting phase, and its growth correlates with improved generalization. The activation is not a pathology — it is a structural feature. The network creates an internal waste channel: a representational pathway that absorbs the noise component of the data, leaving the remaining capacity available for clean signal representation.

The decomposition is functional. The noise flows into the extreme activation channel; the signal flows through the remaining activations. The two streams are separated not by explicit architecture but by the training dynamics themselves. Gradient descent, continuing past the overfitting threshold, discovers that the cheapest way to fit noise while maintaining generalization is to partition its representational capacity — dedicating one direction to waste and using the rest for signal.

This reframes what “overfitting noise” means. The network is not confused about which data points are noisy. It has identified the noisy component and routed it into a disposal system. The memorization of noise and the generalization of signal are not contradictory objectives but complementary parts of a learned decomposition. The model that fits everything correctly has not failed to distinguish signal from noise. It has distinguished them and filed them separately.

Overfitting is not the failure mode. It is the mechanism by which the network builds its filter.


No comments yet.