PyTorch implementation of a collections of scalable Video Transformer Benchmarks.

Last update: Jan 08, 2023

Overview

PyTorch implementation of Video Transformer Benchmarks

This repository is mainly built upon Pytorch and Pytorch-Lightning. We wish to maintain a collections of scalable video transformer benchmarks, and discuss the training recipes of how to train a big video transformer model.

Now, we implement the TimeSformer and ViViT. And we have pre-trained the TimeSformer-B on Kinetics600, but still can't guarantee the performance reported in the paper. However, we find some relevant hyper-parameters which may help us to reach the target performance.

Difference
TODO
Setup
Usage
Result
Acknowledge
Contribution

Difference

In order to share the basic divided spatial-temporal attention module to different video transformer, we make some changes in the following apart.

1. Position embedding

We split the position embedding from R(n^t*h*w×d) mentioned in the ViViT paper into R(n^h*w×d) and R(n^t×d) to stay the same as TimeSformer.

2. Class token

In order to make clear whether to add the class_token into the module forward computation, we only compute the interaction between class_token and query when the current layer is the last layer (except FFN) of each transformer block.

3. Initialize from the pre-trained model

Tokenization: the token embedding filter can be chosen either Conv2D or Conv3D, and the initializing weights of Conv3D filters from Conv2D can be replicated along temporal dimension and averaging them or initialized with zeros along the temporal positions except at the center t/2.
Temporal MSA module weights: one can choose to copy the weights from spatial MSA module or initialize all weights with zeros.
Initialize from the MAE pre-trained model provided by ZhiLiang, where the class_token that does not appear in the MAE pre-train model is initialized from truncated normal distribution.
Initialize from the ViT pre-trained model can be found here.

TODO

add more TimeSformer and ViViT variants pre-trained weights.
- A larger version and other operation types.
add linear prob and partial fine-tune.
- Make available to transfer the pre-trained model to downstream task.
add more scalable Video Transformer benchmarks.
- We will also extend to multi-modality version, e.g Perceiver is coming soon.
add more diverse objective functions.
- Pre-train on larger dataset through the dominated self-supervised methods, e.g Contrastive Learning and MAE.

Setup

pip install -r requirements.txt

Usage

Training

# path to Kinetics600 train set
TRAIN_DATA_PATH='/path/to/Kinetics600/train_list.txt'
# path to root directory
ROOT_DIR='/path/to/work_space'

python model_pretrain.py \
	-lr 0.005 \
	-pretrain 'vit' \
	-epoch 15 \
	-batch_size 8 \
	-num_class 600 \
	-frame_interval 32 \
	-root_dir ROOT_DIR \
	-train_data_path TRAIN_DATA_PATH

The minimal folder structure will look like as belows.

root_dir
├── pretrain_model
│   ├── pretrain_mae_vit_base_mask_0.75_400e.pth
│   ├── vit_base_patch16_224.pth
├── results
│   ├── experiment_tag
│   │   ├── ckpt
│   │   ├── log

Inference

# path to Kinetics600 pre-trained model
PRETRAIN_PATH='/path/to/pre-trained model'
# path to the test video sample
VIDEO_PATH='/path/to/video sample'

python model_inference.py \
	-pretrain PRETRAIN_PATH \
	-video_path VIDEO_PATH \
	-num_frames 8 \
	-frame_interval 32 \

Result

Kinetics-600

1. Model Zoo

name	pretrain	epochs	num frames	spatial crop	top1_acc	top5_acc	weight	log
TimeSformer-B	ImageNet-21K	15e	8	224	78.4	93.6	Google drive or BaiduYun(code: yr4j)	log

2. Train Recipe(ablation study)

2.1 Acc

operation	top1_acc	top5_acc	top1_acc (three crop)
base	68.2	87.6	-
+ `frame_interval` 4 -> 16 (span more time)	72.9(+4.7)	91.0(+3.4)	-
+ RandomCrop, flip (overcome overfit)	75.7(+2.8)	92.5(+1.5)	-
+ `batch size` 16 -> 8 (more iterations)	75.8(+0.1)	92.4(-0.1)	-
+ `frame_interval` 16 -> 24 (span more time)	77.7(+1.9)	93.3(+0.9)	78.4
+ `frame_interval` 24 -> 32 (span more time)	78.4(+0.7)	94.0(+0.7)	79.1

tips: frame_interval and data augment counts for the validation accuracy.

2.2 Time

operation	epoch_time
base (start with DDP)	9h+
+ `speed up training recipes`	1h+
+ switch from `get_batch first` to `sample_Indice first`	0.5h
+ `batch size` 16 -> 8	33.32m
+ `num_workers` 8 -> 4	35.52m
+ `frame_interval` 16 -> 24	44.35m

tips: Improve the frame_interval will drop a lot on time performance.

1.speed up training recipes:

More GPU device.
pin_memory=True.
Avoid CPU->GPU Device transfer (such as .item(), .numpy(), .cpu() operations on tensor or log to disk).

2.get_batch first means that we firstly read all frames through the video reader, and then get the target slice of frames, so it largely slow down the data-loading speed.

Acknowledge

this repo is built on top of Pytorch-Lightning, decord and kornia. I also learn many code designs from MMaction2. I thank the authors for releasing their code.

Contribution

I look forward to seeing one can provide some ideas about the repo, please feel free to report it in the issue, or even better, submit a pull request.

And your star is my motivation, thank u~

PyTorch implementation of a collections of scalable Video Transformer Benchmarks.

Related tags

Overview

PyTorch implementation of Video Transformer Benchmarks

Table of Contents

Difference

1. Position embedding

2. Class token

3. Initialize from the pre-trained model

TODO

Setup

Usage

Training

Inference

Result

Kinetics-600

1. Model Zoo

2. Train Recipe(ablation study)

2.1 Acc

2.2 Time

Acknowledge

Contribution

Owner

Xin Ma

PyTorch implementation of SCAFFOLD (Stochastic Controlled Averaging for Federated Learning, ICML 2020).

Object detection GUI based on PaddleDetection

HyperSeg: Patch-wise Hypernetwork for Real-time Semantic Segmentation Official PyTorch Implementation

SWA Object Detection

g2o: A General Framework for Graph Optimization

Automatic Calibration for Non-repetitive Scanning Solid-State LiDAR and Camera Systems

Learning to See by Looking at Noise

BaseCls BaseCls 是一个基于 MegEngine 的预训练模型库，帮助大家挑选或训练出更适合自己科研或者业务的模型结构

cisip-FIRe - Fast Image Retrieval

Pytorch implementation of the paper "Class-Balanced Loss Based on Effective Number of Samples"

Is RobustBench/AutoAttack a suitable Benchmark for Adversarial Robustness?

Official PyTorch implementation of "Improving Face Recognition with Large AgeGaps by Learning to Distinguish Children" (BMVC 2021)

PyTorch implementation of SIFT descriptor

Official implementation of the RAVE model: a Realtime Audio Variational autoEncoder

Polynomial-time Meta-Interpretive Learning

torchsummaryDynamic: support real FLOPs calculation of dynamic network or user-custom PyTorch ops

Official implementation of "CrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D Point Cloud Understanding" (CVPR, 2022)

A pytorch implementation of the ACL2019 paper "Simple and Effective Text Matching with Richer Alignment Features".

Generate high quality pictures. GAN. Generative Adversarial Networks

This repository is maintained for the scientific paper tittled " Study of keyword extraction techniques for Electric Double Layer Capacitor domain using text similarity indexes: An experimental analysis "