Audio samples from "Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis"

Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

Author: Hubert Siuzdak

Paper: arXiv
Code: GitHub


Updates

2023-06-12

Added examples generated with Bark text-to-audio model. Check them out here.


Abstract Recent advancements in neural vocoding are predominantly driven by Generative Adversarial Networks (GANs) operating in the time-domain. While effective, this approach neglects the inductive bias offered by time-frequency representations, resulting in reduntant and computionally-intensive upsampling operations. Fourier-based time-frequency representation is an appealing alternative, aligning more accurately with human auditory perception, and benefitting from well-established fast algorithms for its computation. Nevertheless, direct reconstruction of complex-valued spectrograms has been historically problematic, primarily due to phase recovery issues. This study seeks to close this gap by presenting Vocos, a new model that addresses the key challenges of modeling spectral coefficients. Vocos demonstrates improved computational efficiency, achieving an order of magnitude increase in speed compared to prevailing time-domain neural vocoding approaches. As shown by objective evaluation, Vocos not only matches state-of-the-art audio quality, but thanks to frequency-aware generator, also effectively mitigates the periodicity issues frequently associated with time-domain GANs. The source code and model weights have been open-sourced at https://github.com/charactr-platform/vocos.


Figure 1: Comparison of a typical time-domain GAN vocoder (a), with the proposed Vocos architecture (b) that maintains the same temporal resolution across all layers. Time-domain vocoders use transposed convolutions to sequentially upsample the signal to the desired sample rate. In contrast, Vocos achieves this by using a computationally efficient inverse Fourier transform.

Resynthesis from neural audio codec (EnCodec)

1.5 kbps

Ground truth EnCodec Vocos

3 kbps

Ground truth EnCodec Vocos

6 kbps

Ground truth EnCodec Vocos

12 kbps

Ground truth EnCodec Vocos

Resynthesis from mel-spectrograms

Ground truth HiFi-GAN BigVGAN iSTFTNet Vocos

Audio reconstruction from Bark tokens

Sequence of tokens generated with Bark text-to-audio model: https://github.com/suno-ai/bark

Text prompt EnCodec Vocos
So, you've heard about neural vocoding? [laughs] We've been messing around with this new model called Vocos.
Ok [clears throat] let's compare the audio outputs. Listen carefully to the differences in each sample's quality and artifacts.
My friend’s bakery burned down last night. [sighs] Now his business is toast.
Schweinsteiger ist ein nationales kulturgut. Wir müssen ihn um jeden preis schützen.
Polecam odwiedzenie Starego Miasta w Szczecinie! Architektura jest piękna, a lokalna kuchnia doskonała!
我计划在下周的游泳比赛中和我的朋友托尼比赛。他认为自己可以打败我,但他不知道我一直在浴缸里偷偷练习游泳。我不敢说我会赢,但我很确定我会搞出一片浪花。
Bonjour. Aujourd’hui, nous sommes içi pour manger trop de glace.
हॉटस्टार पर रुद्र सबसे बेहतरीन शो है! कहानी बेहद शानदार है, और अजय देवगन बहुत खूबसूरत लगते हैं।
¿Estos payasos llamaron a su modelo como un ladrido de perro? [laughs] ¿En serio?
추석은 내가 가장 좋아하는 명절이다. 나는 며칠 동안 휴식을 취하고 친구 및 가족과 시간을 보낼 수 있습니다