ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning. In ICCV, 2021.

Last update: Nov 08, 2022

Related tags

Overview

ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning

This repository contains the code for our ICCV 2021 paper:

ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning
Sangho Lee*, Jiwan Chung*, Youngjae Yu, Gunhee Kim, Thomas Breuel, Gal Chechik, Yale Song (*: equal contribution)
[paper]

@inproceedings{lee2021acav100m,
    title="{ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning}",
    author={Sangho Lee and Jiwan Chung and Youngjae Yu and Gunhee Kim and Thomas Breuel and Gal Chechik and Yale Song},
    booktitle={ICCV},
    year=2021
}

System Requirements

Python >= 3.8.5
FFMpeg 4.3.1

Installation

Install PyTorch 1.6.0, torchvision 0.7.0 and torchaudio 0.6.0 for your environment. Follow the instructions in HERE.
Install the other required packages.

pip install -r requirements.txt
python -m nltk.downloader 'punkt'
pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/<cuda version>/torch1.6/index.html
pip install git+https://github.com/jiwanchung/slowfast
pip install torch-scatter==2.0.5 -f https://pytorch-geometric.com/whl/torch-1.6.0+<cuda version>.html

e.g. Replace <cuda version> with cu102 for CUDA 10.2.

Input File Structure

Create the data directory

mkdir data

Prepare the input file.

data/metadata.tsv should be structured as follows. We provide an example input file in examples/metadata.tsv

YOUTUBE_ID\t{"LatestDAFeature": {"Title": TITLE, "Description": DESCRIPTION, "YouTubeCategory": YOUTUBE_CATEGORY, "VideoLength": VIDEO_LENGTH}, "MediaVersionList": [{"Duration": DURATION}]}

Data Curation Pipeline

One-Liner

bash ./run.sh

To enable GPU computation, modify the CUDA_VISIBLE_DEVICES environment variable accordingly. For example, run the above command as export CUDA_VISIBLE_DEVICES=2,3; bash ./run.sh.

Step-by-Step

Filter the videos with metadata.

bash ./metadata_filtering/code/run.sh

The above command will build the data/filtered.tsv file.

Download the actual video files from youtube.

bash ./video_download/code/run.sh

Although we provide a simple download script, we recommend more scalable solutions for downloading large-scale data.

The above command will download the files to data/videos/raw directory.

Segment the videos into 10-second clips.

bash ./clip_segmentation/code/run.sh

The above command will save the segmented clips to data/videos directory.

Extract features from the clips.

bash ./feature_extraction/code/run.sh

The above command will save the extracted features to data/features directory.

This step requires GPU for faster computation.

Perform clustering with the extracted features.

bash ./clustering/code/run.sh

The above command will save the extracted features to data/clusters directory.

This step requires GPU for faster computation.

Select subset with high audio-visual correspondence using the clustering results.

bash ./subset_selection/code/run.sh

The above command will save the selected clip indices to data/datasets directory.

This step requires GPU for faster computation.

The final output should be saved in the data/output.csv file.

Output File Structure

output.csv is structured as follows. We provide an example output file at examples/output.csv.

# SHARD_NAME,FILENAME,YOUTUBE_ID,SEGMENT
shard-000009,qpxektwhzra_292.mp4,qpxektwhzra,"[292.3329999997, 302.3329999997]"

Evaluation

Instructions on downstream evaluation are provided in Evaluation.

Correspondence Retrieval

Instructions on correspondence retrieval experiments are provided in Correspondence Retrieval.

ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning. In ICCV, 2021.

Related tags

Overview

ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning

System Requirements

Installation

Input File Structure

Data Curation Pipeline

One-Liner

Step-by-Step

Output File Structure

Evaluation

Correspondence Retrieval

Owner

sangho.lee

Recurrent Conditional Query Learning

Transformer Huffman coding - Complete Huffman coding through transformer

A denoising autoencoder + adversarial losses and attention mechanisms for face swapping.

Cross-Task Consistency Learning Framework for Multi-Task Learning

A general-purpose encoder-decoder framework for Tensorflow

Code for the prototype tool in our paper "CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning".

PyTorch DepthNet Training on Still Box dataset

TensorFlow Metal Backend on Apple Silicon Experiments (just for fun)

PyTorch Implementation of "Light Field Image Super-Resolution with Transformers"

Python inverse kinematics for your robot model based on Pinocchio.

i-RevNet Pytorch Code

A Comparative Framework for Multimodal Recommender Systems

NLP made easy

A python library for implementing a recommender system

Uses Open AI Gym environment to create autonomous cryptocurrency bot to trade cryptocurrencies.

Sequence modeling benchmarks and temporal convolutional networks

OBG-FCN - implementation of 'Object Boundary Guided Semantic Segmentation'

Reliable probability face embeddings

The official homepage of the (outdated) COCO-Stuff 10K dataset.

Code and data of the ACL 2021 paper: Few-Shot Text Ranking with Meta Adapted Synthetic Weak Supervision