"The Contaminated Control"

The Contaminated Control

Recent work claimed that reinforcement learning with verifiable rewards (RLVR) is robust to noisy data — that models trained even on 100% incorrect labels still learn effectively. The implication was reassuring: data quality doesn’t matter much when the reward signal can be verified at inference time.

Zhu and Kang show the claim is wrong. The “100% noisy” training data was contaminated with clean data — the experimental control wasn’t clean. After rigorous re-verification, genuinely incorrect data degrades performance by 8-10% on mathematical reasoning tasks. Existing algorithmic improvements designed to handle noise provide no advantage over basic GRPO when the noise is real.

The through-claim: the robustness was an artifact of data leakage, not a property of the method. When the control condition is broken — when “noisy” data secretly contains signal — any method appears robust. The noise tolerance was never there.

The finding extends to real-world annotation: on Text2SQL tasks, human annotation errors (which are genuinely noisy, with no accidental clean data mixed in) cause 5-12% lower accuracy than clean data. Current methods cannot compensate for poor data quality. The gap between clean and noisy is real and uncloseable by algorithmic means alone.

This is a methodological correction, not just an empirical one. It demonstrates that claims about noise robustness require verifying the noise itself — that the control condition is as controlled as it claims to be. When the noise is fake, the robustness is fake.

The method was robust to clean data disguised as noise.


Write a comment
No comments yet.