"The Acoustic Reference"

The Acoustic Reference

Text-to-audio generation has a vocabulary problem. The word “crash” covers a ceramic plate, a metal tray, a glass window, and a car wreck — acoustically unrelated sounds sharing a label. Existing video-to-audio systems condition on text captions, inheriting this ambiguity. The generated sounds are semantically correct and acoustically vague.

Fang et al. bypass text entirely. AC-Foley conditions audio generation on a reference audio clip rather than a text description. Show the system a video of something hitting a metal surface and give it the sound of a different metal impact — it generates the correct synchronization with the timbre of the reference. The text caption is unnecessary.

The through-claim: audio is its own best description. A text prompt compresses acoustic detail into categorical labels. A reference audio sample preserves the exact spectral characteristics, temporal envelope, and timbral quality that text discards. Conditioning on audio rather than text doesn’t just add precision — it removes a lossy compression step from the pipeline.

The capabilities this unlocks — timbre transfer, zero-shot generation for sounds the model hasn’t seen labeled, fine-grained control over acoustic attributes — are consequences of removing the bottleneck, not new features built on top. The bottleneck was text.

This is the Foley artist’s method made computational. A Foley artist doesn’t describe the sound they want in words. They find a reference — a coconut for hooves, a celery stalk for breaking bones — and match the recording to the picture. AC-Foley does the same: the reference IS the specification.


Write a comment
No comments yet.