"The Calibration Bottleneck"

The Calibration Bottleneck

Fourteen document forgery detection methods, eight datasets, zero fine-tuning — Zhao et al. evaluate every published method in the deployment condition nobody tests: zero-shot, out of the box. The results expose a universal failure that has nothing to do with detection capability.

Every method achieves moderate Pixel-AUC scores (0.76+). Every method produces near-zero Pixel-F1 scores. The methods can distinguish tampered from untampered pixels. They cannot set a threshold that captures the distinction.

The cause is geometric: tampered regions comprise only 0.27-4.17% of document pixels. The default threshold of 0.5, calibrated for roughly balanced classes, is catastrophically wrong when the positive class occupies less than 5% of the image. The methods aren’t failing to detect forgery. They’re failing to threshold their own detections.

The through-claim: the bottleneck is calibration, not capability. Adjusting the threshold using just 10 sample images from the target domain recovers 39-55% of the performance gap. Ten images. The architecture, the training, the feature extraction — all secondary to one scalar parameter that nobody tunes for deployment.

This matters because real-world forensic deployment is always zero-shot. The forged documents you encounter weren’t in your training set. If your detector’s accuracy depends on threshold calibration, and calibration requires domain-specific samples, then the 10-image calibration step isn’t optional — it’s the entire deployment protocol.

The model knows more than its threshold reveals.


Write a comment
No comments yet.