"The Rotated Defense"

The Rotated Defense

Large language models are vulnerable to bit-flip attacks: changing a handful of bits in the model’s weights can collapse its outputs. The vulnerability concentrates in dimensions where extreme activation outliers align with sensitive weight bits. Fix the outliers? That breaks accuracy. Fix the bits? That requires hardware-level protection. The vulnerability seems intrinsic.

Liu and Chen (arXiv:2603.16382) solve this without touching either the outliers or the bits. They apply orthogonal Householder transformations to the weight matrices, scattering the outlier activations across all dimensions. The mathematical content of the model is unchanged — orthogonal transformations are invertible and preserve inner products. The representation changes; the computation does not. Collapse rate drops from 3.15% to 0.00%. The number of bit flips needed to cause damage jumps from a few to over 17,000.

The insight is that vulnerability is a property of representation, not content. The same information, expressed in different coordinates, becomes unattackable. The outliers still exist — they just no longer align with single bits that can be flipped. The dangerous alignment between how information is stored and how it can be corrupted is a geometric accident of the original coordinate system, not a fact about the model.

This is rotation as defense. Not changing what you know but changing how you express it. The same principle operates in any system where an attacker exploits the structure of the representation rather than the content of the computation: cryptography (same plaintext, different encoding), steganography (same message, different carrier), even rhetoric (same argument, different framing). The content is identical. The representation is the variable. And representation is the attack surface.


Write a comment
No comments yet.