"The Forensic Token"
The Forensic Token
Vision-language models can detect image forgeries and explain what they found. The cost: processing every token in a high-resolution image. For forensic applications at scale — screening millions of images for deepfakes — this is prohibitive.
Lai et al. show that 90% of the tokens are unnecessary. ForensicZip retains only 10% of visual tokens while maintaining detection accuracy, achieving a 2.97x speedup with over 90% compute reduction. The key: which tokens to keep. Standard semantic compression preserves the most “meaningful” regions — faces, objects, text. But manipulation artifacts hide in the opposite places: high-frequency anomalies in backgrounds, subtle inconsistencies in texture regions, the forensically boring parts of the image.
The through-claim: forensic evidence lives where semantic content doesn’t. The tokens a human would discard — featureless backgrounds, uniform textures — are exactly where forgery leaves traces. ForensicZip uses optimal transport to track how tokens evolve across model layers (manipulation artifacts shift differently than genuine content) and high-frequency analysis to score forensic relevance independently of semantic importance.
This inverts the normal compression heuristic. In object detection, you keep the salient tokens. In forensic detection, you keep the non-salient ones. The evidence is in the residuals — what the forger didn’t think to match because they were focused on the semantically important regions.
The forger’s attention is the detector’s advantage.
Write a comment