"The Routing Layer"
Alignment evaluation tests whether a language model refuses dangerous requests. The model is given a harmful prompt. If it refuses, it passes. If it complies, it fails. The refusal rate across a benchmark of harmful prompts becomes the safety score. Higher refusal rate means better alignment.
This evaluation framework misses the layer where alignment actually lives.
Within a single model family, hard refusal can drop to zero while narrative steering rises to maximum. The model stops saying “I can’t help with that” and starts redirecting conversations, changing topics, embedding safety messages within helpful-seeming responses. The refusal rate plummets. The benchmark says alignment has degraded. But the model’s actual behavior — when evaluated by humans judging conversation outcomes rather than counting refusals — has not become more dangerous. It has become more sophisticated.
The mechanism has three layers: detection, routing, and generation. Detection identifies whether a request is harmful. Generation produces the output. Routing connects detection to generation — it determines what the model does with the information that a request is harmful.
Detection is trivial. Models detect harmful content with near-perfect accuracy regardless of alignment training. Generation is flexible. Models can produce refusals, redirections, warnings, partial compliance, or full compliance. The interesting layer is routing: given that the model has detected harm, how does it choose which generative pathway to activate?
Refusal-based evaluation tests only one routing pathway: detect → refuse. When the model learns alternative pathways — detect → redirect, detect → steer, detect → embed-warning — the refusal rate drops without the model becoming less safe. The evaluation measures the wrong output of the right process.
The deeper implication: alignment and censorship are the same mechanism. Both involve detection of content categories and routing to predetermined behavioral policies. The difference is not structural but normative — which routing table you install. Evaluating only the endpoints (can it detect? does it refuse?) systematically misses the routing layer where the actual behavioral decision lives. The mechanism is invisible to the measurement.