Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance

ECCV 2026

Akio Hayakawa1, Masato Ishii1, Takashi Shibuya1, Yuki Mitsufuji1,2

1Sony AI    2Sony Group Corporation

Abstract

We propose a step-by-step video-to-audio (V2A) generation method that provides finer control over the generation process and more realistic audio synthesis. Inspired by traditional Foley workflows, our approach enables incremental generation of complementary sounds, allowing users to author multiple sound events induced by a video. To avoid the need for costly multi-reference video-audio datasets, each generation step is formulated as a negatively guided V2A process that discourages duplication of sounds already present in previously generated tracks. The guidance model is trained by finetuning a pre-trained V2A model on audio pairs from non-overlapping segments of the same video, encouraging it to leverage acoustic context while remaining visually grounded, and enabling training with standard single-reference audiovisual datasets. Objective and subjective evaluations demonstrate that our method enhances the separability of generated sounds at each step and improves the overall quality of the final composite audio, outperforming existing baselines.

Step-by-step video-to-audio generation teaser
Step-by-step video-to-audio generation for compositional sound effect creation.

Generated Samples

The linked pages contain videos with audio. All videos have been visually compressed to reduce file size for demonstration purposes and may differ slightly from the actual inputs to the model.

Quick Demo

One representative composite audio example is embedded below for quick comparison. See the other sample links below for detailed individual-track and composite comparisons.

MMAudio + NAG (Ours)

MMAudio

MMAudio + Negative Prompting

Individual Audio Track Comparisons

Per-track comparisons between MMAudio, negative prompting, and our Negative Audio Guidance.

Composite Audio Comparisons

Composite audio comparisons with state-of-the-art video-to-audio baselines.

BibTeX

@inproceedings{hayakawa2026step,
  title     = {Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance},
  author    = {Hayakawa, Akio and Ishii, Masato and Shibuya, Takashi and Mitsufuji, Yuki},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}