"The Gender Mask"

The Gender Mask

Audio deepfake detection systems report aggregate accuracy — a single number for how well they distinguish real from synthetic speech. Fursule, Kshirsagar, and Avila show this single number hides a structured failure.

When they disaggregate Equal Error Rate by gender, systems that appear equitable by standard benchmarks reveal significant disparities. The detection performance differs between male and female voices — sometimes favoring one, sometimes the other, depending on the system and the type of synthetic speech. The aggregate EER masks these differences because it averages over a bimodal distribution.

The through-claim: the fairness problem is a measurement problem. Standard evaluation metrics are designed to summarize performance, not to reveal structure. When the underlying performance has structure — consistent differences across demographic groups — the summary statistic actively conceals the problem. A system with 5% EER overall could have 2% on male voices and 8% on female voices. The aggregate number is accurate. It is also misleading.

The fix isn’t better algorithms — it’s better evaluation. Disaggregated metrics, demographic-specific benchmarks, and fairness-aware training objectives require knowing that the problem exists. The aggregate metrics make the problem invisible, which makes the fix impossible to justify.

The deepfake detection works. It just works differently for different people. The aggregate score is a mask.


Write a comment
No comments yet.