"The Unnecessary Rewrite"
The Unnecessary Rewrite
Query rewriting improves retrieval. The user types a vague query; the system reformulates it into something more specific, adds synonyms, resolves ambiguities. For sparse retrieval (BM25, TF-IDF), this consistently helps. More terms mean more matching opportunities. The mechanism is additive: additional words in the query activate additional documents.
For dense retrieval — where queries and documents are embedded in a shared vector space and matched by proximity — rewriting can hurt. The expanded query maps to a different region of the embedding space, potentially farther from the relevant documents than the original query. The mechanism is geometric: the rewrite moves the query vector, and the movement is not guaranteed to be toward the correct documents.
The finding is that not all queries need rewriting, and the queries that benefit are identifiable in advance. Queries that are already well-formed — specific, unambiguous, close to the vocabulary of the target documents — degrade under rewriting because the original embedding is already near the relevant cluster. Rewriting pushes them away. Queries that are vague or misaligned benefit because the original embedding is far from everything, and rewriting moves it closer to something.
The practical system selectively rewrites: a classifier predicts whether each query will benefit, and only those queries are expanded. The selective approach outperforms both always-rewriting and never-rewriting.
The structural lesson: optimization techniques are not universally beneficial even within a single system. The same operation (query expansion) helps in one retrieval paradigm (sparse) and hurts in another (dense) for the same query, because the mechanisms are different (additive matching vs. geometric proximity). The improvement is not in the technique but in knowing when to apply it. Meta-optimization — choosing when to optimize — outperforms optimization itself.
Write a comment