A minimal Conformer ASR implementation adapted from ESPnet.

Last update: Jan 24, 2022

Related tags

Overview

Conformer ASR

A minimal Conformer ASR implementation adapted from ESPnet.

Introduction

I want to use the pre-trained English ASR model provided by ESPnet. However, ESPnet is relatively heavy for me. So here I try to extract only the conformer ASR part from ESPnet so that I can do better customization. Let's do it.

There are bunch of models available for ASR listed here. I choose the one with name:

kamo-naoyuki/librispeech_asr_train_asr_conformer6_n_fft512_hop_length256_raw_en_bpe5000_scheduler_confwarmup_steps40000_optim_conflr0.0025_sp_valid.acc.ave

Its performance can be found [here](https://zenodo.org/record/4604066#.YbxsX5FByV4), toggle me to see.

dataset	Snt	Wrd	Corr	Sub	Del	Ins	Err	S.Err
decode_asr_asr_model_valid.acc.ave/dev_clean	2703	54402	97.9	1.9	0.2	0.2	2.3	28.6
decode_asr_asr_model_valid.acc.ave/dev_other	2864	50948	94.5	5.1	0.5	0.6	6.1	48.3
decode_asr_asr_model_valid.acc.ave/test_clean	2620	52576	97.7	2.1	0.2	0.3	2.6	31.4
decode_asr_asr_model_valid.acc.ave/test_other	2939	52343	94.7	4.9	0.5	0.7	6.0	49.0
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/dev_clean	2703	54402	98.3	1.5	0.2	0.2	1.9	25.2
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/dev_other	2864	50948	95.8	3.7	0.4	0.5	4.6	40.0
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/test_clean	2620	52576	98.1	1.7	0.2	0.3	2.1	26.2
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/test_other	2939	52343	95.8	3.7	0.5	0.5	4.7	42.4

dataset	Snt	Wrd	Corr	Sub	Del	Ins	Err	S.Err
decode_asr_asr_model_valid.acc.ave/dev_clean	2703	288456	99.4	0.3	0.2	0.2	0.8	28.6
decode_asr_asr_model_valid.acc.ave/dev_other	2864	265951	98.0	1.2	0.8	0.7	2.7	48.3
decode_asr_asr_model_valid.acc.ave/test_clean	2620	281530	99.4	0.3	0.3	0.3	0.9	31.4
decode_asr_asr_model_valid.acc.ave/test_other	2939	272758	98.2	1.0	0.7	0.7	2.5	49.0
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/dev_clean	2703	288456	99.5	0.3	0.2	0.2	0.7	25.2
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/dev_other	2864	265951	98.3	1.0	0.7	0.5	2.2	40.0
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/test_clean	2620	281530	99.5	0.3	0.3	0.2	0.7	26.2
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/test_other	2939	272758	98.5	0.8	0.7	0.5	2.1	42.4

dataset	Snt	Wrd	Corr	Sub	Del	Ins	Err	S.Err
decode_asr_asr_model_valid.acc.ave/dev_clean	2703	68010	97.5	1.9	0.7	0.4	2.9	28.6
decode_asr_asr_model_valid.acc.ave/dev_other	2864	63110	93.4	5.0	1.6	1.0	7.6	48.3
decode_asr_asr_model_valid.acc.ave/test_clean	2620	65818	97.2	2.0	0.8	0.4	3.3	31.4
decode_asr_asr_model_valid.acc.ave/test_other	2939	65101	93.7	4.5	1.8	0.9	7.2	49.0
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/dev_clean	2703	68010	97.8	1.5	0.7	0.3	2.5	25.2
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/dev_other	2864	63110	94.6	3.8	1.6	0.7	6.1	40.0
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/test_clean	2620	65818	97.6	1.6	0.8	0.3	2.7	26.2
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/test_other	2939	65101	94.7	3.5	1.8	0.7	6.0	42.4

ASR step by step

1. Setup code

pip install .

2. Download the model and unzip it

wget https://zenodo.org/record/4604066/files/asr_train_asr_conformer6_n_fft512_hop_length256_raw_en_bpe5000_scheduler_confwarmup_steps40000_optim_conflr0.0025_sp_valid.acc.ave.zip?download=1 -o conformer.zip
unzip conformer.zip

3. Run an example

import torch
import librosa
from mmds.utils.spectrogram import MelSpectrogram
from conformer_asr import Conformer, Tokenizer

sample_rate = 16000
cfg_path = "./exp_unnorm/asr_train_asr_conformer6_n_fft512_hop_length256_raw_en_unnorm_bpe5000/config.yaml"
bpe_path = "./data/en_unnorm_token_list/bpe_unigram5000/bpe.model"
ckpt_path = "./exp_unnorm/asr_train_asr_conformer6_n_fft512_hop_length256_raw_en_unnorm_bpe5000/valid.acc.ave_10best.pth"

tokenizer = Tokenizer(cfg_path, bpe_path)
conformer = Conformer(tokenizer, ckpt_path=ckpt_path)
conformer.eval()

spec_fn = MelSpectrogram(
    sample_rate,
    hop_length=256,
    f_min=0,
    f_max=8000,
    win_length=512,
    power=2,
)

w0, _ = librosa.load("./example.m4a", sample_rate)
w0 = torch.from_numpy(w0)
m0 = spec_fn(w0).t()

l = len(m0)

# create batch with different length audio (yes, supported)
x = [m0, m0[: l // 2], m0[: l // 4]]

ref = "This is a test video for youtube-dl. For more information, contact [email protected]".lower()
hyps = conformer.decode(x, beam_width=20)

print("REF", ref)
for hyp in hyps:
    print("HYP", hyp.lower())

Results

REF this is a test video for youtube-dl. for more information, contact [email protected]
HYP this is a test video for you do bl for more information -- contact the hih aging at the hihaging, not the
HYP this is a test for you d bl for more information
HYP this is a testim for you to

A minimal Conformer ASR implementation adapted from ESPnet.

Related tags

Overview

Conformer ASR

Introduction

ASR step by step

1. Setup code

2. Download the model and unzip it

3. Run an example

Features

Supported

Not supported yet

Owner

Niu Zhe

HuggingTweets - Train a model to generate tweets

The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

Nateve compiler developed with python.

A number of methods in order to perform Natural Language Processing on live data derived from Twitter

Perform sentiment analysis on textual data that people generally post on websites like social networks and movie review sites.

A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.

Creating a python chatbot that Starbucks users can text to place an order + help cut wait time of a normal coffee.

Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

A Telegram bot to add notes to Flomo.

A multi-voice TTS system trained with an emphasis on quality

End-to-end image captioning with EfficientNet-b3 + LSTM with Attention

Deep learning for NLP crash course at ABBYY.

iBOT: Image BERT Pre-Training with Online Tokenizer

A Python module made to simplify the usage of Text To Speech and Speech Recognition.

Twitter-Sentiment-Analysis - Analysis of twitter posts' positive and negative score.

Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

Code examples for my Write Better Python Code series on YouTube.