This is an official implementation for "Video Swin Transformers".

Overview

Video Swin Transformer

PWC PWC PWC

By Ze Liu*, Jia Ning*, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin and Han Hu.

This repo is the official implementation of "Video Swin Transformer". It is based on mmaction2.

Updates

06/25/2021 Initial commits

Introduction

Video Swin Transformer is initially described in "Video Swin Transformer", which advocates an inductive bias of locality in video Transformers, leading to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization. The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage the power of pre-trained image models. Our approach achieves state-of-the-art accuracy on a broad range of video recognition benchmarks, including on action recognition (84.9 top-1 accuracy on Kinetics-400 and 86.1 top-1 accuracy on Kinetics-600 with ~20x less pre-training data and ~3x smaller model size) and temporal modeling (69.6 top-1 accuracy on Something-Something v2).

teaser

Results and Models

Kinetics 400

Backbone Pretrain Lr Schd spatial crop [email protected] [email protected] #params FLOPs config model
Swin-T ImageNet-1K 30ep 224 78.8 93.6 28M 87.9G config github/baidu
Swin-S ImageNet-1K 30ep 224 80.6 94.5 50M 165.9G config github/baidu
Swin-B ImageNet-1K 30ep 224 80.6 94.6 88M 281.6G config github/baidu
Swin-B ImageNet-22K 30ep 224 82.7 95.5 88M 281.6G config github/baidu

Kinetics 600

Backbone Pretrain Lr Schd spatial crop [email protected] [email protected] #params FLOPs config model
Swin-B ImageNet-22K 30ep 224 84.0 96.5 88M 281.6G config github/baidu

Something-Something V2

Backbone Pretrain Lr Schd spatial crop [email protected] [email protected] #params FLOPs config model
Swin-B Kinetics 400 60ep 224 69.6 92.7 89M 320.6G config github/baidu

Notes:

Usage

Installation

Please refer to install.md for installation.

We also provide docker file cuda10.1 (image url) and cuda11.0 (image url) for convenient usage.

Data Preparation

Please refer to data_preparation.md for a general knowledge of data preparation. The supported datasets are listed in supported_datasets.md.

Inference

# single-gpu testing
python tools/test.py <CONFIG_FILE> <CHECKPOINT_FILE> --eval top_k_accuracy

# multi-gpu testing
bash tools/dist_test.sh <CONFIG_FILE> <CHECKPOINT_FILE> <GPU_NUM> --eval top_k_accuracy

Training

To train a video recognition model with pre-trained image models (for Kinetics-400 and Kineticc-600 datasets), run:

# single-gpu training
python tools/train.py <CONFIG_FILE> --cfg-options model.backbone.pretrained=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

# multi-gpu training
bash tools/dist_train.sh <CONFIG_FILE> <GPU_NUM> --cfg-options model.backbone.pretrained=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

For example, to train a Swin-T model for Kinetics-400 dataset with 8 gpus, run:

bash tools/dist_train.sh configs/recognition/swin/swin_tiny_patch244_window877_kinetics400_1k.py 8 --cfg-options model.backbone.pretrained=<PRETRAIN_MODEL> 

To train a video recognizer with pre-trained video models (for Something-Something v2 datasets), run:

# single-gpu training
python tools/train.py <CONFIG_FILE> --cfg-options load_from=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

# multi-gpu training
bash tools/dist_train.sh <CONFIG_FILE> <GPU_NUM> --cfg-options load_from=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

For example, to train a Swin-B model for SSv2 dataset with 8 gpus, run:

bash tools/dist_train.sh configs/recognition/swin/swin_base_patch244_window1677_sthv2.py 8 --cfg-options load_from=<PRETRAIN_MODEL>

Note: use_checkpoint is used to save GPU memory. Please refer to this page for more details.

Apex (optional):

We use apex for mixed precision training by default. To install apex, use our provided docker or run:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

If you would like to disable apex, comment out the following code block in the configuration files:

# do not use mmcv version fp16
fp16 = None
optimizer_config = dict(
    type="DistOptimizerHook",
    update_interval=1,
    grad_clip=None,
    coalesce=True,
    bucket_size_mb=-1,
    use_fp16=True,
)

Citation

If you find our work useful in your research, please cite:

@article{liu2021video,
  title={Video Swin Transformer},
  author={Liu, Ze and Ning, Jia and Cao, Yue and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Hu, Han},
  journal={arXiv preprint arXiv:2106.13230},
  year={2021}
}

@article{liu2021Swin,
  title={Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},
  author={Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining},
  journal={arXiv preprint arXiv:2103.14030},
  year={2021}
}

Other Links

Image Classification: See Swin Transformer for Image Classification.

Object Detection: See Swin Transformer for Object Detection.

Semantic Segmentation: See Swin Transformer for Semantic Segmentation.

Self-Supervised Learning: See MoBY with Swin Transformer.

Owner
Swin Transformer
This organization maintains repositories built on Swin Transformers. The pretrained models locate at https://github.com/microsoft/Swin-Transformer
Swin Transformer
[ICCV'2021] "SSH: A Self-Supervised Framework for Image Harmonization", Yifan Jiang, He Zhang, Jianming Zhang, Yilin Wang, Zhe Lin, Kalyan Sunkavalli, Simon Chen, Sohrab Amirghodsi, Sarah Kong, Zhangyang Wang

SSH: A Self-Supervised Framework for Image Harmonization (ICCV 2021) code for SSH Representative Examples Main Pipeline RealHM DataSet Google Drive Pr

VITA 86 Dec 02, 2022
Official implementation of EfficientPose

EfficientPose This is the official implementation of EfficientPose. We based our work on the Keras EfficientDet implementation xuannianz/EfficientDet

2 May 17, 2022
RMNA: A Neighbor Aggregation-Based Knowledge Graph Representation Learning Model Using Rule Mining

RMNA: A Neighbor Aggregation-Based Knowledge Graph Representation Learning Model Using Rule Mining Our code is based on Learning Attention-based Embed

宋朝都 4 Aug 07, 2022
Performant, differentiable reinforcement learning

deluca Performant, differentiable reinforcement learning Notes This is pre-alpha software and is undergoing a number of core changes. Updates to follo

Google 114 Dec 27, 2022
Using VapourSynth with super resolution models and speeding them up with TensorRT.

VSGAN-tensorrt-docker Using image super resolution models with vapoursynth and speeding them up with TensorRT. Using NVIDIA/Torch-TensorRT combined wi

111 Jan 05, 2023
Laplacian Score-regularized Concrete Autoencoders

Laplacian Score-regularized Concrete Autoencoders Requirements: torch = 1.9 scikit-learn = 0.24 omegaconf = 2.0.6 scipy = 1.6.0 matplotlib How to

JS 6 Dec 07, 2022
Probabilistic Tensor Decomposition of Neural Population Spiking Activity

Probabilistic Tensor Decomposition of Neural Population Spiking Activity Matlab (recommended) and Python (in developement) implementations of Soulat e

Hugo Soulat 6 Nov 30, 2022
On Nonlinear Latent Transformations for GAN-based Image Editing - PyTorch implementation

On Nonlinear Latent Transformations for GAN-based Image Editing - PyTorch implementation On Nonlinear Latent Transformations for GAN-based Image Editi

Valentin Khrulkov 22 Oct 24, 2022
Implementation of gaze tracking and demo

Predicting Customer Demand by Using Gaze Detecting and Object Tracking This project is the integration of gaze detecting and object tracking. Predict

2 Oct 20, 2022
A Python library that enables ML teams to share, load, and transform data in a collaborative, flexible, and efficient way :chestnut:

Squirrel Core Share, load, and transform data in a collaborative, flexible, and efficient way What is Squirrel? Squirrel is a Python library that enab

Merantix Momentum 249 Dec 07, 2022
Continuous Time LiDAR odometry

CT-ICP: Elastic SLAM for LiDAR sensors This repository implements the SLAM CT-ICP (see our article), a lightweight, precise and versatile pure LiDAR o

385 Dec 29, 2022
Code Repository for The Kaggle Book, Published by Packt Publishing

The Kaggle Book Data analysis and machine learning for competitive data science Code Repository for The Kaggle Book, Published by Packt Publishing "Lu

Packt 1.6k Jan 07, 2023
Confidence Propagation Cluster aims to replace NMS-based methods as a better box fusion framework in 2D/3D Object detection

CP-Cluster Confidence Propagation Cluster aims to replace NMS-based methods as a better box fusion framework in 2D/3D Object detection, Instance Segme

Yichun Shen 41 Dec 08, 2022
A real-time speech emotion recognition application using Scikit-learn and gradio

Speech-Emotion-Recognition-App A real-time speech emotion recognition application using Scikit-learn and gradio. Requirements librosa==0.6.3 numpy sou

Son Tran 6 Oct 04, 2022
Joint learning of images and text via maximization of mutual information

mutual_info_img_txt Joint learning of images and text via maximization of mutual information. This repository incorporates the algorithms presented in

Ruizhi Liao 10 Dec 22, 2022
A minimalist environment for decision-making in autonomous driving

highway-env A collection of environments for autonomous driving and tactical decision-making tasks An episode of one of the environments available in

Edouard Leurent 1.6k Jan 07, 2023
scalingscattering

Scaling The Scattering Transform : Deep Hybrid Networks This repository contains the experiments found in the paper: https://arxiv.org/abs/1703.08961

Edouard Oyallon 78 Dec 21, 2022
"Reinforcement Learning for Bandit Neural Machine Translation with Simulated Human Feedback"

This is code repo for our EMNLP 2017 paper "Reinforcement Learning for Bandit Neural Machine Translation with Simulated Human Feedback", which implements the A2C algorithm on top of a neural encoder-

Khanh Nguyen 131 Oct 21, 2022
Repository for the semantic WMI loss

Installation: pip install -e . Installing DL2: First clone DL2 in a separate directory and install it using the following commands: git clone https:/

Nick Hoernle 4 Sep 15, 2022
Learning the Beauty in Songs: Neural Singing Voice Beautifier; ACL 2022 (Main conference); Official code

Learning the Beauty in Songs: Neural Singing Voice Beautifier Jinglin Liu, Chengxi Li, Yi Ren, Zhiying Zhu, Zhou Zhao Zhejiang University ACL 2022 Mai

Jinglin Liu 257 Dec 30, 2022