"The Misalignment Mirage"

The Misalignment Mirage

CLIP trains image and text embeddings jointly — but when you use the image embeddings alone (for retrieval, few-shot classification), the performance seems suboptimal. The prevailing explanation: contrastive language-image training neglects intra-modal structure. Images that should be close in embedding space aren’t, because the training objective only cares about image-text pairs, not image-image relationships.

Herzog and Wang dismantle this explanation. They reexamine both the theoretical argument and the empirical evidence, finding that the supposed “degrees of freedom” for image embedding distances don’t actually exist. The contrastive loss constrains inter-modal alignment, yes — but those constraints propagate to intra-modal distances more than the original argument acknowledged.

The through-claim: the problem attributed to intra-modal misalignment is actually task ambiguity. When you add methods that appear to “fix” misalignment and get better results, what you’re actually doing is resolving ambiguity — providing the model with additional signal about what the task requires. The improvement isn’t because the embeddings were misaligned; it’s because the task was underspecified.

The evidence: image-only models (DINO, SigLIP2) show the same performance patterns as language-image models (CLIP, SigLIP). If misalignment from cross-modal training were the cause, image-only models should be immune. They’re not.

This is a correction that matters. The narrative that cross-modal training damages intra-modal structure was becoming a design principle — motivating architectural choices and training modifications. If the real issue is task specification, the interventions should target the interface, not the representation.

The diagnosis was wrong. The treatment worked for other reasons.


Write a comment
No comments yet.