Wav2Vec for speech recognition, classification, and audio classification

Last update: Dec 15, 2022

Overview

Soxan

در زبان پارسی به نام سخن

This repository consists of models, scripts, and notebooks that help you to use all the benefits of Wav2Vec 2.0 in your research. In the following, I'll show you how to train speech tasks in your dataset and how to use the pretrained models.

How to train

I'm just at the beginning of all the possible speech tasks. To start, we continue the training script with the speech emotion recognition problem.

Training - Notebook

Task	Notebook
Speech Emotion Recognition (Wav2Vec 2.0)
Speech Emotion Recognition (Hubert)
Audio Classification (Wav2Vec 2.0)

Training - CMD

python3 run_wav2vec_clf.py \
    --pooling_mode="mean" \
    --model_name_or_path="lighteternal/wav2vec2-large-xlsr-53-greek" \
    --model_mode="wav2vec2" \ # or you can use hubert
    --output_dir=/path/to/output \
    --cache_dir=/path/to/cache/ \
    --train_file=/path/to/train.csv \
    --validation_file=/path/to/dev.csv \
    --test_file=/path/to/test.csv \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --gradient_accumulation_steps=2 \
    --learning_rate=1e-4 \
    --num_train_epochs=5.0 \
    --evaluation_strategy="steps"\
    --save_steps=100 \
    --eval_steps=100 \
    --logging_steps=100 \
    --save_total_limit=2 \
    --do_eval \
    --do_train \
    --fp16 \
    --freeze_feature_extractor

Prediction

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchaudio
from transformers import AutoConfig, Wav2Vec2FeatureExtractor
from src.models import Wav2Vec2ForSpeechClassification, HubertForSpeechClassification

model_name_or_path = "path/to/your-pretrained-model"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
config = AutoConfig.from_pretrained(model_name_or_path)
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name_or_path)
sampling_rate = feature_extractor.sampling_rate

# for wav2vec
model = Wav2Vec2ForSpeechClassification.from_pretrained(model_name_or_path).to(device)

# for hubert
model = HubertForSpeechClassification.from_pretrained(model_name_or_path).to(device)


def speech_file_to_array_fn(path, sampling_rate):
    speech_array, _sampling_rate = torchaudio.load(path)
    resampler = torchaudio.transforms.Resample(_sampling_rate, sampling_rate)
    speech = resampler(speech_array).squeeze().numpy()
    return speech


def predict(path, sampling_rate):
    speech = speech_file_to_array_fn(path, sampling_rate)
    inputs = feature_extractor(speech, sampling_rate=sampling_rate, return_tensors="pt", padding=True)
    inputs = {key: inputs[key].to(device) for key in inputs}

    with torch.no_grad():
        logits = model(**inputs).logits

    scores = F.softmax(logits, dim=1).detach().cpu().numpy()[0]
    outputs = [{"Emotion": config.id2label[i], "Score": f"{round(score * 100, 3):.1f}%"} for i, score in
               enumerate(scores)]
    return outputs


path = "/path/to/disgust.wav"
outputs = predict(path, sampling_rate)

Output:

[
    {'Emotion': 'anger', 'Score': '0.0%'},
    {'Emotion': 'disgust', 'Score': '99.2%'},
    {'Emotion': 'fear', 'Score': '0.1%'},
    {'Emotion': 'happiness', 'Score': '0.3%'},
    {'Emotion': 'sadness', 'Score': '0.5%'}
]

Demos

Demo	Link
Speech To Text With Emotion Recognition (Persian) - soon	huggingface.co/spaces/m3hrdadfi/speech-text-emotion

Models

Dataset	Model
ShEMO: a large-scale validated database for Persian speech emotion detection	m3hrdadfi/wav2vec2-xlsr-persian-speech-emotion-recognition
ShEMO: a large-scale validated database for Persian speech emotion detection	m3hrdadfi/hubert-base-persian-speech-emotion-recognition
ShEMO: a large-scale validated database for Persian speech emotion detection	m3hrdadfi/hubert-base-persian-speech-gender-recognition
Speech Emotion Recognition (Greek) (AESDD)	m3hrdadfi/hubert-large-greek-speech-emotion-recognition
Speech Emotion Recognition (Greek) (AESDD)	m3hrdadfi/hubert-base-greek-speech-emotion-recognition
Speech Emotion Recognition (Greek) (AESDD)	m3hrdadfi/wav2vec2-xlsr-greek-speech-emotion-recognition
Eating Sound Collection	m3hrdadfi/wav2vec2-base-100k-eating-sound-collection
GTZAN Dataset - Music Genre Classification	m3hrdadfi/wav2vec2-base-100k-gtzan-music-genres

Wav2Vec for speech recognition, classification, and audio classification

Related tags

Overview

Soxan

How to train

Training - Notebook

Training - CMD

Prediction

Demos

Models

Owner

Mehrdad Farahani

Official PyTorch implementation of Synergies Between Affordance and Geometry: 6-DoF Grasp Detection via Implicit Representations

Learning RAW-to-sRGB Mappings with Inaccurately Aligned Supervision (ICCV 2021)

TransPrompt - Towards an Automatic Transferable Prompting Framework for Few-shot Text Classification

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

An implementation of Video Frame Interpolation via Adaptive Separable Convolution using PyTorch

Supplementary materials to "Spin-optomechanical quantum interface enabled by an ultrasmall mechanical and optical mode volume cavity" by H. Raniwala, S. Krastanov, M. Eichenfield, and D. R. Englund, 2022

Code for testing various M1 Chip benchmarks with TensorFlow.

Fine-grained Control of Image Caption Generation with Abstract Scene Graphs

Source codes for the paper "Local Additivity Based Data Augmentation for Semi-supervised NER"

PaddleRobotics is an open-source algorithm library for robots based on Paddle, including open-source parts such as human-robot interaction, complex motion control, environment perception, SLAM positioning, and navigation.

PCAM: Product of Cross-Attention Matrices for Rigid Registration of Point Clouds

Kernel Point Convolutions

Augmented CLIP - Training simple models to predict CLIP image embeddings from text embeddings, and vice versa.

Implementation of FitVid video prediction model in JAX/Flax.

Official PyTorch implementation of "Contrastive Learning from Extremely Augmented Skeleton Sequences for Self-supervised Action Recognition" in AAAI2022.

Visualization toolkit for neural networks in PyTorch! Demo -->

Dynamic vae - Dynamic VAE algorithm is used for anomaly detection of battery data

Simulating Sycamore quantum circuits classically using tensor network algorithm.

A Python training and inference implementation of Yolov5 helmet detection in Jetson Xavier nx and Jetson nano

This repository contains a re-implementation of the code for the CVPR 2021 paper "Omnimatte: Associating Objects and Their Effects in Video."