Who calls the shots? Rethinking Few-Shot Learning for Audio (WASPAA 2021)

Overview

rethink-audio-fsl

This repo contains the source code for the paper "Who calls the shots? Rethinking Few-Shot Learning for Audio." (WASPAA 2021)

Table of Contents

Dataset

Models in this work are trained on FSD-MIX-CLIPS, an open dataset of programmatically mixed audio clips with a controlled level of polyphony and signal-to-noise ratio. We use single-labeled clips from FSD50K as the source material for the foreground sound events and Brownian noise as the background to generate 281,039 10-second strongly-labeled soundscapes with Scaper. We refer this (intermediate) dataset of 10s soundscapes as FSD-MIX-SED. Each soundscape contains n events from n different sound classes where n is ranging from 1 to 5. We then extract 614,533 1s clips centered on each sound event in the soundscapes in FSD-MIX-SED to produce FSD-MIX-CLIPS.

Due to the large size of the dataset, instead of releasing the raw audio files, we release the source material, a subset of FSD50K, and soundscape annotations in JAMS format which can be used to reproduce FSD-MIX-SED using Scaper. All clips in FSD-MIX-CLIPS are extracted from FSD-MIX-SED. Therefore, for FSD-MIX-CLIPS, instead of releasing duplicated audio content, we provide annotations that specify the filename in FSD-MIX-SED and the corresponding starting time (in second) of each 1-second clip.

To reproduce FSD-MIX-SED:

  1. Download all files from Zenodo.
  2. Extract .tar.gz files. You will get
  • FSD_MIX_SED.annotations: 281,039 annotation files, 35GB
  • FSD_MIX_SED.source: 10,296 single-labeled audio clips, 1.9GB
  • FSD_MIX_CLIPS.annotations: 5 annotation files for each class/data split
  • vocab.json: 89 classes, each class is then labeled by its index in the list in following experiments. 0-58: base, 59-73: novel-val, 74-88: novel-test.

We will use FSD_MIX_SED.annotations and FSD_MIX_SED.source to reproduce the audio data in FSD_MIX_SED, and use the audio with FSD_MIX_CLIPS.annotation for the following training and evaluation.

  1. Install Scaper
  2. Generate soundscapes from jams files by running the command. Set annpaths and audiopath to the extracted folders, and savepath to the desired path to save output audio files.
python ./data/generate_soundscapes.py \
--annpath PATH-TO-FSD_MIX_SED.annotations \
--audiopath PATH-TO-FSD_MIX_SED.source \
--savepath PATH-TO-SAVE-OUTPUT

Note that this will generate 281,039 audio files with a size of ~450GB to the folder FSD_MIX_SED.audio at the set savepath.

If you want to get the foreground material (FSD-MIX-SED.source) directly from FSD50K instead of downloading them, run

python ./data/preprocess_foreground_sounds.py \
--fsdpath PATH-TO-FSD50K \
--outpath PATH_TO_SAVE_OUTPUT

Experiment

We provide source code to train the best performing embedding model (pretrained OpenL3 + FC) and three different few-shot methods to predict both base and novel class data.

Preprocessing

Once audio files are reproduced, we pre-compute OpenL3 embeddings of clips in FSD-MIX-CLIPS and save them.

  1. Install OpenL3
  2. Set paths of the downloaded FSD_MIX_CLIPS.annotations and generated FSD_MIX_SED.audio, and run
python get_openl3emb_and_filelist.py \
--annpath PATH-TO-FSD_MIX_CLIPS.annotations \
--audiopath PATH-TO-FSD_MIX_SED.audio \
--savepath PATH_TO_SAVE_OUTPUT

This generates 614,533 .pkl files where each file contains an embedding. A set of filelists will also be saved under current folder.

Environment

Create conda environment from the environment.yml file and activate it.

Note that you only need the environment if you want to train/evaluate the models. For reproducing the dataset, see Dataset.

conda env create -f environment.yml
conda activate dfsl

Training

  • Training configuration can be specified using config files in ./config
  • Model checkpoints will be saved in the folder ./experiments, and tensorboard data will be saved in the folder ./run

1. Base classifier

First, to train the base classifier on base classes, run

python train.py --config openl3CosineClassifier --openl3

2. Few-shot weight generator for DFSL

Once the base model is trained, we can train the few-shot weight generator for DFSL by running

python train.py --config openl3CosineClassifierGenWeightAttN5 --openl3

By default, DFSL is trained with 5 support examples: n=5, to train DFSL with different n, run

# n=10
python train.py --config openl3CosineClassifierGenWeightAttN10 --openl3

# n=20
python train.py --config openl3CosineClassifierGenWeightAttN20 --openl3

# n=30
python train.py --config openl3CosineClassifierGenWeightAttN30 --openl3

Evaluation

We evaluate the trained models on test data from both base and novel classes. For each novel class, we need to sample a support set. Run the command below to split the original filelist for test classes to test_support_filelist.pkl and test_query_filelist.pkl.

python get_test_support_and_query.py
  • Here we consider monophonic support examples with mixed(random) SNR. Code to run evaluation with polyphonic support examples with specific low/high SNR will be released soon.

For evaluation, we compute features for both base and novel test data, then make predictions and compute metrics in a joint label space. The computed features, model predictions, and metrics will be saved in the folder ./experiments. We consider 3 few-shot methods to predict novel classes. To test different number of support examples, set different n_pos in the following commands.

1. Prototype

# Extract embeddings of evaluation data and save them.
python save_features.py --config=openl3CosineClassifier --openl3

# Get and save model prediction, run this multiple time (niter) to count for random selection of novel examples.
python pred.py --config=openl3CosineClassifier --openl3 --niter 100 --n_base 59 --n_novel 15 --n_pos 5

# compute and save evaluation metrics based on model prediction
python metrics.py --config=audioset_pannCosineClassifier --openl3 --n_base 59 --n_novel 15 --n_pos 5

2. DFSL

# Extract embeddings of evaluation data and save them.
python save_features.py --config=openl3CosineClassifierGenWeightAttN5 --openl3

# Get and save model prediction, run this multiple time (niter) to count for random selection of novel examples.
python pred.py --config=openl3CosineClassifierGenWeightAttN5 --openl3 --niter 100 --n_base 59 --n_novel 15 --n_pos 5

# compute and save evaluation metrics based on model prediction
python metrics.py --config=audioset_pannCosineClassifierGenWeightAttN5 --openl3 --n_base 59 --n_novel 15 --n_pos 5

3. Logistic regression

Train a binary logistic regression model for each novel class. Note that we need to sample n_neg of examples from the base training data as the negative examples. Default n_neg is 100. We also did a hyperparameter search on n_neg based on the validation data while n_pos changing from 5 to 30:

  • n_pos=5, n_neg=100
  • n_pos=10, n_neg=500
  • n_pos=20, n_neg=1000
  • n_pos=30, n_neg=5000
# Extract embeddings of evaluation data and save them.
python save_features.py --config=openl3CosineClassifier --openl3

# Train binary logistic regression models, predict test data, and compute metrics
python logistic_regression.py --config=openl3CosineClassifier --openl3 --niter 10 --n_base 59 --n_novel 15 --n_pos 5 --n_neg 100

Reference

This code is built upon the implementation from FewShotWithoutForgetting

Citation

Please cite our paper if you find the code or dataset useful for your research.

Y. Wang, N. J. Bryan, J. Salamon, M. Cartwright, and J. P. Bello. "Who calls the shots? Rethinking Few-shot Learning for Audio", IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2021

Owner
Yu Wang
Ph.D. Candidate
Yu Wang
Make your master artistic punk avatar through machine learning world famous paintings.

Master-art-punk Make your master artistic punk avatar through machine learning world famous paintings. 通过机器学习世界名画制作属于你的大师级艺术朋克头像 Nowadays, NFT is beco

Philipjhc 53 Dec 27, 2022
Implementation of our paper "DMT: Dynamic Mutual Training for Semi-Supervised Learning"

DMT: Dynamic Mutual Training for Semi-Supervised Learning This repository contains the code for our paper DMT: Dynamic Mutual Training for Semi-Superv

Zhengyang Feng 120 Dec 30, 2022
Code for approximate graph reduction techniques for cardinality-based DSFM, from paper

SparseCard Code for approximate graph reduction techniques for cardinality-based DSFM, from paper "Approximate Decomposable Submodular Function Minimi

Nate Veldt 1 Nov 25, 2022
Suite of 500 procedurally-generated NLP tasks to study language model adaptability

TaskBench500 The TaskBench500 dataset and code for generating tasks. Data The TaskBench dataset is available under wget http://web.mit.edu/bzl/www/Tas

Belinda Li 20 May 17, 2022
A Python 3 package for state-of-the-art statistical dimension reduction methods

direpack: a Python 3 library for state-of-the-art statistical dimension reduction techniques This package delivers a scikit-learn compatible Python 3

Sven Serneels 32 Dec 14, 2022
Official Implementation of LARGE: Latent-Based Regression through GAN Semantics

LARGE: Latent-Based Regression through GAN Semantics [Project Website] [Google Colab] [Paper] LARGE: Latent-Based Regression through GAN Semantics Yot

83 Dec 06, 2022
The implementation of the paper "HIST: A Graph-based Framework for Stock Trend Forecasting via Mining Concept-Oriented Shared Information".

The HIST framework for stock trend forecasting The implementation of the paper "HIST: A Graph-based Framework for Stock Trend Forecasting via Mining C

Wentao Xu 110 Dec 27, 2022
Stock-history-display - something like a easy yearly review for your stock performance

Stock History Display Available on Heroku: https://stock-history-display.herokua

LiaoJJ 1 Jan 07, 2022
Code for Paper Predicting Osteoarthritis Progression via Unsupervised Adversarial Representation Learning

Predicting Osteoarthritis Progression via Unsupervised Adversarial Representation Learning (c) Tianyu Han and Daniel Truhn, RWTH Aachen University, 20

Tianyu Han 7 Nov 22, 2022
Data-Driven Operational Space Control for Adaptive and Robust Robot Manipulation

OSCAR Project Page | Paper This repository contains the codebase used in OSCAR: Data-Driven Operational Space Control for Adaptive and Robust Robot Ma

NVIDIA Research Projects 74 Dec 22, 2022
Gym Threat Defense

Gym Threat Defense The Threat Defense environment is an OpenAI Gym implementation of the environment defined as the toy example in Optimal Defense Pol

Hampus Ramström 5 Dec 08, 2022
Bunch of different tools which helps visualizing and annotating images for semantic/instance segmentation tasks

Data Framework for Semantic/Instance Segmentation Bunch of different tools which helps visualizing, transforming and annotating images for semantic/in

Bruno Fernandes Carvalho 5 Dec 21, 2022
Synthesize photos from PhotoDNA using machine learning 🌱

Ribosome Synthesize photos from PhotoDNA. See the blog post for more information. Installation Dependencies You can install Python dependencies using

Anish Athalye 112 Nov 23, 2022
Self-Supervised Monocular DepthEstimation with Internal Feature Fusion(arXiv), BMVC2021

DIFFNet This repo is for Self-Supervised Monocular DepthEstimation with Internal Feature Fusion(arXiv), BMVC2021 A new backbone for self-supervised de

Hang 94 Dec 25, 2022
Multi-Objective Loss Balancing for Physics-Informed Deep Learning

Multi-Objective Loss Balancing for Physics-Informed Deep Learning Code for ReLoBRaLo. Abstract Physics Informed Neural Networks (PINN) are algorithms

Rafael Bischof 16 Dec 12, 2022
This repository contains the code and models for the following paper.

DC-ShadowNet Introduction This is an implementation of the following paper DC-ShadowNet: Single-Image Hard and Soft Shadow Removal Using Unsupervised

AuAgCu 65 Dec 27, 2022
Scheme for training and applying a label propagation framework

Factorisation-based Image Labelling Overview This is a scheme for training and applying the factorisation-based image labelling (FIL) framework. Some

Wellcome Centre for Human Neuroimaging 2 Dec 17, 2021
An extremely simple, intuitive, hardware-friendly, and well-performing network structure for LiDAR semantic segmentation on 2D range image. IROS21

FIDNet_SemanticKITTI Motivation Implementing complicated network modules with only one or two points improvement on hardware is tedious. So here we pr

YimingZhao 54 Dec 12, 2022
Taming Transformers for High-Resolution Image Synthesis

Taming Transformers for High-Resolution Image Synthesis CVPR 2021 (Oral) Taming Transformers for High-Resolution Image Synthesis Patrick Esser*, Robin

CompVis Heidelberg 3.5k Jan 03, 2023
Dialect classification

Dialect-Classification This repository presents the data that was used in a talk at ICKL-5 (5th International Conference on Kurdish Linguistics) at th

Kurdish-BLARK 0 Nov 12, 2021