HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Last update: Jan 02, 2023

Overview

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Jungil Kong, Jaehyeon Kim, Jaekyoung Bae

In our paper, we proposed HiFi-GAN: a GAN-based model capable of generating high fidelity speech efficiently.
We provide our implementation and pretrained models as open source in this repository.

Abstract : Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms. Although such methods improve the sampling efficiency and memory usage, their sample quality has not yet reached that of autoregressive and flow-based generative models. In this work, we propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis. As speech audio consists of sinusoidal signals with various periods, we demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality. A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that our proposed method demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than real-time on a single V100 GPU. We further show the generality of HiFi-GAN to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis. Finally, a small footprint version of HiFi-GAN generates samples 13.4 times faster than real-time on CPU with comparable quality to an autoregressive counterpart.

Visit our demo website for audio samples.

Pre-requisites

Python >= 3.6
Clone this repository.
Install python requirements. Please refer requirements.txt
Download and extract the LJ Speech dataset. And move all wav files to LJSpeech-1.1/wavs

Training

python train.py --config config_v1.json

To train V2 or V3 Generator, replace config_v1.json with config_v2.json or config_v3.json.
Checkpoints and copy of the configuration file are saved in cp_hifigan directory by default.
You can change the path by adding --checkpoint_path option.

Validation loss during training with V1 generator.

Pretrained Model

You can also use pretrained models we provide.
Download pretrained models
Details of each folder are as in follows:

Folder Name	Generator	Dataset	Fine-Tuned
LJ_V1	V1	LJSpeech	No
LJ_V2	V2	LJSpeech	No
LJ_V3	V3	LJSpeech	No
LJ_FT_T2_V1	V1	LJSpeech	Yes (Tacotron2)
LJ_FT_T2_V2	V2	LJSpeech	Yes (Tacotron2)
LJ_FT_T2_V3	V3	LJSpeech	Yes (Tacotron2)
VCTK_V1	V1	VCTK	No
VCTK_V2	V2	VCTK	No
VCTK_V3	V3	VCTK	No
UNIVERSAL_V1	V1	Universal	No

We provide the universal model with discriminator weights that can be used as a base for transfer learning to other datasets.

Fine-Tuning

Generate mel-spectrograms in numpy format using Tacotron2 with teacher-forcing.
The file name of the generated mel-spectrogram should match the audio file and the extension should be .npy.
Example:
```
Audio File : LJ001-0001.wav
Mel-Spectrogram File : LJ001-0001.npy
```
Create ft_dataset folder and copy the generated mel-spectrogram files into it.
Run the following command.
```
python train.py --fine_tuning True --config config_v1.json
```
For other command line options, please refer to the training section.

Inference from wav file

Make test_files directory and copy wav files into the directory.

Run the following command.

python inference.py --checkpoint_file [generator checkpoint file path]

Generated wav files are saved in generated_files by default.
You can change the path by adding --output_dir option.

Inference for end-to-end speech synthesis

Make test_mel_files directory and copy generated mel-spectrogram files into the directory.
You can generate mel-spectrograms using Tacotron2, Glow-TTS and so forth.

Run the following command.

python inference_e2e.py --checkpoint_file [generator checkpoint file path]

Generated wav files are saved in generated_files_from_mel by default.
You can change the path by adding --output_dir option.

Acknowledgements

We referred to WaveGlow, MelGAN and Tacotron2 to implement this.

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Related tags

Overview

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Jungil Kong, Jaehyeon Kim, Jaekyoung Bae

Pre-requisites

Training

Pretrained Model

Fine-Tuning

Inference from wav file

Inference for end-to-end speech synthesis

Acknowledgements

Owner

Jungil Kong

Submit issues and feature requests for our API here.

A BERT-based reverse-dictionary of Korean proverbs

Arabic-Phonetic-Output - You can input the phonetic version of any Arabic text here. This software will show you output in Arabic (with vowels)

Enterprise Scale NLP with Hugging Face & SageMaker Workshop series

OpenChat: Opensource chatting framework for generative models

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

In this project, we aim to achieve the task of predicting emojis from tweets. We aim to investigate the relationship between words and emojis.

This is a MD5 password/passphrase brute force tool

Document processing using transformers

NVDA, the free and open source Screen Reader for Microsoft Windows

Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models

Incorporating KenLM language model with HuggingFace implementation of Wav2Vec2CTC Model using beam search decoding

Proquabet - Convert your prose into proquints and then you essentially have Vogon poetry

This is the offline-training-pipeline for our project.

LewusBot - Twitch ChatBot built in python with twitchio library

Chinese NewsTitle Generation Project by GPT2.带有超级详细注释的中文GPT2新闻标题生成项目。

[KBS] Aspect-based sentiment analysis via affective knowledge enhanced graph convolutional networks

[ICCV 2021] Instance-level Image Retrieval using Reranking Transformers

Research code for the paper "Fine-tuning wav2vec2 for speaker recognition"