PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

Last update: Jan 02, 2023

Related tags

Text Data & NLP xcit

Overview

Cross-Covariance Image Transformer (XCiT)

PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

Linear complexity in time and memory

Our XCiT models has a linear complexity w.r.t number of patches/tokens: $\mathcal{O}(N d ^2)$


Peak Memory (inference)	Millisecond/Image (Inference)

Scaling to high resolution inputs

XCiT can scale to high resolution inputs both due to cheaper compute requirement as well as better adaptability to higher resolution at test time (see Figure 3 in the paper)

Detection and Instance Segmentation for Ultra high resolution images (6000x4000)

)

XCiT+DINO: High Res. Self-Attention Visualization 🦖

Our XCiT models with self-supervised training using DINO can obtain high resolution attention maps.

xcit_dino.mp4

Self-Attention visualization per head

Below we show the attention maps for each of the 8 heads separately and we can observe that every head specializes in different semantic aspects of the scene for the foreground as well as the background.

Multi_head.mp4

Getting Started

First, clone the repo

git clone https://github.com/facebookresearch/XCiT.git

Then, you can install the required packages including: Pytorch version 1.7.1, torchvision version 0.8.2 and Timm version 0.4.8

pip install -r requirements.txt

Download and extract the ImageNet dataset. Afterwards, set the --data-path argument to the corresponding extracted ImageNet path.

For full details about all the available arguments, you can use

python main.py --help

For detection and segmentation downstream tasks, please check:

COCO Object detection and Instance segmentation: XCiT Detection
ADE20k Semantic segmentation: XCiT Semantic Segmentation

Model Zoo

We provide XCiT models pre-trained weights on ImageNet-1k.

§: distillation

Models with 16x16 patch size

Arch	params	Model
		224		224 §		384 §
		top-1	weights	top-1	weights	top-1	weights
xcit_nano_12_p16	3M	69.9%	download	72.2%	download	75.4%	download
xcit_tiny_12_p16	7M	77.1%	download	78.6%	download	80.9%	download
xcit_tiny_24_p16	12M	79.4%	download	80.4%	download	82.6%	download
xcit_small_12_p16	26M	82.0%	download	83.3%	download	84.7%	download
xcit_small_24_p16	48M	82.6%	download	83.9%	download	85.1%	download
xcit_medium_24_p16	84M	82.7%	download	84.3%	download	85.4%	download
xcit_large_24_p16	189M	82.9%	download	84.9%	download	85.8%	download

Models with 8x8 patch size

Arch	params	Model
		224		224 §		384 §
		top-1	weights	top-1	weights	top-1	weights
xcit_nano_12_p8	3M	73.8%	download	76.3%	download	77.8%	download
xcit_tiny_12_p8	7M	79.7%	download	81.2%	download	82.4%	download
xcit_tiny_24_p8	12M	81.9%	download	82.6%	download	83.7%	download
xcit_small_12_p8	26M	83.4%	download	84.2%	download	85.1%	download
xcit_small_24_p8	48M	83.9%	download	84.9%	download	85.6%	download
xcit_medium_24_p8	84M	83.7%	download	85.1%	download	85.8%	download
xcit_large_24_p8	189M	84.4%	download	85.4%	download	86.0%	download

XCiT + DINO Self-supervised models

Arch	params	k-nn	linear	download
xcit_small_12_p16	26M	76.0%	77.8%	backbone
xcit_small_12_p8	26M	77.1%	79.2%	backbone
xcit_medium_24_p16	84M	76.4%	78.8%	backbone
xcit_medium_24_p8	84M	77.9%	80.3%	backbone

Training

For training using a single node, use the following command

python -m torch.distributed.launch --nproc_per_node=[NUM_GPUS] --use_env main.py --model [MODEL_KEY] --batch-size [BATCH_SIZE] --drop-path [STOCHASTIC_DEPTH_RATIO] --output_dir [OUTPUT_PATH]

For example, the XCiT-S12/16 model can be trained using the following command

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --model xcit_small_12_p16 --batch-size 128 --drop-path 0.05 --output_dir /experiments/xcit_small_12_p16/ --epochs [NUM_EPOCHS]

For multinode training via SLURM you can alternatively use

python run_with_submitit.py --partition [PARTITION_NAME] --nodes 2 --ngpus 8 --model xcit_small_12_p16 --batch-size 64 --drop-path 0.05 --job_dir /experiments/xcit_small_12_p16/ --epochs 400

More details for the hyper-parameters used to train the different models can be found in Table B.1 in the paper.

Evaluation

To evaluate an XCiT model using the checkpoints above or models you trained use the following command:

python main.py --eval --model  --input-size  [--full_crop] --pretrained

By default we use the --full_crop flag which evaluates the model with a crop ratio of 1.0 instead of 0.875 following CaiT.

For example, the command to evaluate the XCiT-S12/16 using 224x224 images:

python main.py --eval --model xcit_small_12_p16 --input-size 384 --full_crop --pretrained https://dl.fbaipublicfiles.com/xcit/xcit_small_12_p16_224.pth

Acknowledgement

This repository is built using the Timm library and the DeiT repository. The self-supervised training is based on the DINO repository.

License

This repository is released under the Apache 2.0 license as found in the LICENSE file.

Contributing

We actively welcome your pull requests! Please see CONTRIBUTING.md and CODE_OF_CONDUCT.md for more info.

Citation

If you find this repository useful, please consider citing our work:

@misc{elnouby2021xcit,
      title={XCiT: Cross-Covariance Image Transformers}, 
      author={Alaaeldin El-Nouby and Hugo Touvron and Mathilde Caron and Piotr Bojanowski and Matthijs Douze and Armand Joulin and Ivan Laptev and Natalia Neverova and Gabriel Synnaeve and Jakob Verbeek and Hervé Jegou},
      year={2021},
      journal={arXiv preprint arXiv:2106.09681},
}

PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

Related tags

Overview

Cross-Covariance Image Transformer (XCiT)

Linear complexity in time and memory

Scaling to high resolution inputs

Detection and Instance Segmentation for Ultra high resolution images (6000x4000)

XCiT+DINO: High Res. Self-Attention Visualization 🦖

Getting Started

Model Zoo

Models with 16x16 patch size

Models with 8x8 patch size

XCiT + DINO Self-supervised models

Training

Evaluation

Acknowledgement

License

Contributing

Citation

Owner

Facebook Research

this repository has datasets containing information of Uber pickups in NYC from April 2014 to September 2014 and January to June 2015. data Analysis , virtualization and some insights are gathered here

Simple Text-To-Speech Bot For Discord

Converts python code into c++ by using OpenAI CODEX.

Sequence model architectures from scratch in PyTorch

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

An easier way to build neural search on the cloud

Coreference resolution for English, German and Polish, optimised for limited training data and easily extensible for further languages

Two-stage text summarization with BERT and BART

This is a NLP based project to extract effective date of the contract from their text files.

Transformer-based Text Auto-encoder (T-TA) using TensorFlow 2.

Contract Understanding Atticus Dataset

A demo for end-to-end English and Chinese text spotting using ABCNet.

Utility for Google Text-To-Speech batch audio files generator. Ideal for prompt files creation with Google voices for application in offline IVRs

PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

Mesh TensorFlow: Model Parallelism Made Easier

All the code I wrote for Overwatch-related projects that I still own the rights to.

The RWKV Language Model

xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building blocks.

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)