"The Geometric Hallucination"

The Geometric Hallucination

Hallucinations in language models are usually treated as a single failure mode — the model says something wrong. Korun tests whether different types of hallucination are geometrically distinguishable in the model’s embedding space.

Using GPT-2 and controlled induction prompts, he classifies hallucinations into three geometric types: coverage gaps (the model has no representation for what’s needed), center drift (the representation exists but the model converges to the wrong center), and wrong-well convergence (the model falls into a nearby attractor). Of these three, only coverage gaps show robust geometric separation — distinguishable in 18 out of 20 runs, carried by magnitude rather than direction. The token embeddings for coverage-gap hallucinations are measurably different in size. The other types look geometrically identical to correct outputs.

The through-claim: not all hallucinations are the same failure, and only one type leaves a geometric signature. Coverage gaps — where the model literally lacks the right representation — produce hallucinations you could, in principle, detect from the embedding alone. Center drift and wrong-well convergence produce hallucinations that are geometrically indistinguishable from truth.

A caveat the paper surfaces: token-level statistical tests inflate significance by 4-16x through pseudoreplication — tokens within a generation aren’t independent. The genuine result holds only at the prompt level.

Some lies are detectable. Others look exactly like truth. The geometry tells you which is which — when it works.


Write a comment
No comments yet.