The Tunnel Wall
The Tunnel Wall
Residual networks (ResNets) add skip connections that bypass layers, allowing the network to learn residual corrections to an identity mapping. The skip connections solve the degradation problem — deeper networks train more easily — and have become the default architecture for deep learning. But the expressivity of narrow ResNets — networks where the hidden layer width is smaller than the input dimension — has remained poorly understood.
Kuehn, Kuntz, and Wohrer (arXiv:2603.28591, March 2026) identify a fundamental expressivity barrier. Narrow ResNets cannot represent critical points of input-output mappings without augmenting the input space. A critical point is where the Jacobian of the mapping drops rank — a fold, a cusp, a place where the output topology changes. These are the structurally important features of any nonlinear mapping.
The mechanism is the “tunnel effect.” In a narrow ResNet, information passes through a bottleneck whose width is the number of residual channels. The skip connection preserves the original input alongside the residual, but the two must recombine at each layer. When the residual channels are narrower than the input, the network cannot fold the input-output mapping in the directions that the skip connection preserves. Critical points are pushed to infinity — they exist in the limit but never in finite depth.
The ratio of skip to residual channels creates two qualitatively different operating regimes. When the ratio is small (weak skip, wide residual), the network behaves like a multilayer perceptron — it can represent arbitrary critical points. When the ratio is large (strong skip, narrow residual), the network approaches a neural ODE — smooth, diffeomorphic, unable to fold. The transition between regimes is controlled by a single architectural parameter, and the paper provides explicit bounds showing where each architecture fundamentally fails.
The structural observation: the skip connection that makes ResNets trainable also limits their expressivity. The same mechanism that prevents gradient degradation — the shortcut that preserves the input — prevents the network from representing the topological changes that complex mappings require. The training benefit and the expressivity cost are two sides of the same architectural choice.
Write a comment