Audio samples from "Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis"

Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

Author: Hubert Siuzdak

Paper: arXiv
Code: GitHub



Added examples generated with Bark text-to-audio model. Check them out here.

Abstract Recent advancements in neural vocoding are predominantly driven by Generative Adversarial Networks (GANs) operating in the time-domain. While effective, this approach neglects the inductive bias offered by time-frequency representations, resulting in reduntant and computionally-intensive upsampling operations. Fourier-based time-frequency representation is an appealing alternative, aligning more accurately with human auditory perception, and benefitting from well-established fast algorithms for its computation. Nevertheless, direct reconstruction of complex-valued spectrograms has been historically problematic, primarily due to phase recovery issues. This study seeks to close this gap by presenting Vocos, a new model that addresses the key challenges of modeling spectral coefficients. Vocos demonstrates improved computational efficiency, achieving an order of magnitude increase in speed compared to prevailing time-domain neural vocoding approaches. As shown by objective evaluation, Vocos not only matches state-of-the-art audio quality, but thanks to frequency-aware generator, also effectively mitigates the periodicity issues frequently associated with time-domain GANs. The source code and model weights have been open-sourced at

Figure 1: Comparison of a typical time-domain GAN vocoder (a), with the proposed Vocos architecture (b) that maintains the same temporal resolution across all layers. Time-domain vocoders use transposed convolutions to sequentially upsample the signal to the desired sample rate. In contrast, Vocos achieves this by using a computationally efficient inverse Fourier transform.

Resynthesis from neural audio codec (EnCodec)

1.5 kbps

Ground truth EnCodec Vocos

3 kbps

Ground truth EnCodec Vocos

6 kbps

Ground truth EnCodec Vocos

12 kbps

Ground truth EnCodec Vocos

Resynthesis from mel-spectrograms

Ground truth HiFi-GAN BigVGAN iSTFTNet Vocos

Audio reconstruction from Bark tokens

Sequence of tokens generated with Bark text-to-audio model:

Text prompt EnCodec Vocos
So, you've heard about neural vocoding? [laughs] We've been messing around with this new model called Vocos.
Ok [clears throat] let's compare the audio outputs. Listen carefully to the differences in each sample's quality and artifacts.
My friend’s bakery burned down last night. [sighs] Now his business is toast.
Schweinsteiger ist ein nationales kulturgut. Wir müssen ihn um jeden preis schützen.
Polecam odwiedzenie Starego Miasta w Szczecinie! Architektura jest piękna, a lokalna kuchnia doskonała!
Bonjour. Aujourd’hui, nous sommes içi pour manger trop de glace.
हॉटस्टार पर रुद्र सबसे बेहतरीन शो है! कहानी बेहद शानदार है, और अजय देवगन बहुत खूबसूरत लगते हैं।
¿Estos payasos llamaron a su modelo como un ladrido de perro? [laughs] ¿En serio?
추석은 내가 가장 좋아하는 명절이다. 나는 며칠 동안 휴식을 취하고 친구 및 가족과 시간을 보낼 수 있습니다