Introduction to Vocos: Fast Neural Vocoder

Ng Wai Foong
6 min readJun 28, 2023

Integrate with 🐶 Bark to synthesis high-quality audio waveforms from acoustic features

Image by the author (AI generated)

By reading this piece, you will learn to reconstruct high-quality audio from mel-spectrogram or EnCodec tokens. Based on the official repository, Vocos is

… a fast neural vocoder designed to synthesize audio waveforms from acoustic features. Trained using a Generative Adversarial Network (GAN) objective, Vocos can generate waveforms in a single forward pass. Unlike other typical GAN-based vocoders, Vocos does not model audio samples in the time domain. Instead, it generates spectral coefficients, facilitating rapid audio reconstruction through inverse Fourier transform.

Vocos is capable of producing high-quality audio samples frameworks. The official project page provided some audio samples and comparison against the following state-of-the-art vocoders:

  • HiFi-GAN
  • BigVGAN
  • iSTFTNet

One main advantage of Vocos is that it is capable of reconstructing audio from EnCodec tokens. Hence, Vocos can be easily integrated into most of the latest TTS pipelines that uses EnCodec for neural audio compression. For example, you can generate audio tokens using 🐶 Bark, a transformer-based text-to-audio model, and then reconstruct the final audio…

--

--

Ng Wai Foong

Senior AI Engineer@Yoozoo | Content Writer #NLP #datascience #programming #machinelearning | Linkedin: https://www.linkedin.com/in/wai-foong-ng-694619185/