Official implement of "CAT: Cross Attention in Vision Transformer".

Last update: Dec 15, 2022

Related tags

Overview

CAT: Cross Attention in Vision Transformer

This is official implement of "CAT: Cross Attention in Vision Transformer".

Abstract

Since Transformer has found widespread use in NLP, the potential of Transformer in CV has been realized and has inspired many new approaches. However, the computation required for replacing word tokens with image patches for Transformer after the tokenization of the image is vast(e.g., ViT), which bottlenecks model training and inference. In this paper, we propose a new attention mechanism in Transformer termed Cross Attention, which alternates attention inner the image patch instead of the whole image to capture local information and apply attention between image patches which are divided from single-channel feature maps to capture global information. Both operations have less computation than standard self-attention in Transformer. By alternately applying attention inner patch and between patches, we implement cross attention to maintain the performance with lower computational cost and build a hierarchical network called Cross Attention Transformer(CAT) for other vision tasks. Our base model achieves state-of-the-arts on ImageNet-1K, and improves the performance of other methods on COCO and ADE20K, illustrating that our network has the potential to serve as general backbones.

CAT achieves strong performance on COCO object detection(implemented with mmdectection) and ADE20K semantic segmentation(implemented with mmsegmantation).

Pretrained Models and Results on ImageNet-1K

name	resolution	[email protected]	[email protected]	#params	FLOPs	model	log
CAT-T	224x224	80.3	95.0	17M	2.8G	github	github
CAT-S^*	224x224	81.8	95.6	37M	5.9G	github	github
CAT-B	224x224	82.8	96.1	52M	8.9G	github	github
CAT-T-v2	224x224	81.7	95.5	36M	3.9G	Coming	Coming

Note: ^* indicates new version of model and log.

Models and Results on Object Detection (COCO 2017 val)

Backbone	Method	pretrain	Lr Schd	box mAP	mask mAP	#params	FLOPs	model	log
CAT-S	Mask R-CNN⁺	ImageNet-1K	1x	41.6	38.6	57M	295G	github	github
CAT-B	Mask R-CNN⁺	ImageNet-1K	1x	41.8	38.7	71M	356G	github	github
CAT-S	FCOS	ImageNet-1K	1x	40.0	-	45M	245G	github	github
CAT-B	FCOS	ImageNet-1K	1x	41.0	-	59M	303G	github	github
CAT-S	ATSS	ImageNet-1K	1x	42.0	-	45M	243G	github	github
CAT-B	ATSS	ImageNet-1K	1x	42.5	-	59M	303G	github	github
CAT-S	RetinaNet	ImageNet-1K	1x	40.1	-	47M	276G	github	github
CAT-B	RetinaNet	ImageNet-1K	1x	41.4	-	62M	337G	github	github
CAT-S	Cascade R-CNN	ImageNet-1K	1x	44.1	-	82M	270G	github	github
CAT-B	Cascade R-CNN	ImageNet-1K	1x	44.8	-	96M	330G	github	github
CAT-S	Cascade R-CNN⁺	ImageNet-1K	1x	45.2	-	82M	270G	github	github
CAT-B	Cascade R-CNN⁺	ImageNet-1K	1x	46.3	-	96M	330G	github	github

Note: ⁺ indicates multi-scale training.

Models and Results on Semantic Segmentation (ADE20K val)

Backbone	Method	pretrain	Crop Size	Lr Schd	mIoU	mIoU (ms+flip)	#params	FLOPs	model	log
CAT-S	Semantic FPN	ImageNet-1K	512x512	80K	40.6	42.1	41M	214G	github	github
CAT-B	Semantic FPN	ImageNet-1K	512x512	80K	42.2	43.6	55M	276G	github	github
CAT-S	Semantic FPN	ImageNet-1K	512x512	160K	42.2	42.8	41M	214G	github	github
CAT-B	Semantic FPN	ImageNet-1K	512x512	160K	43.2	44.9	55M	276G	github	github

Citing CAT

You can cite the paper as:

@article{lin2021cat,
  title={CAT: Cross Attention in Vision Transformer},
  author={Hezheng Lin and Xing Cheng and Xiangyu Wu and Fan Yang and Dong Shen and Zhongyuan Wang and Qing Song and Wei Yuan},
  journal={arXiv preprint arXiv:2106.05786},
  year={2021}
}

Started

Please refer to get_started.

Acknowledgement

Our implementation is mainly based on Swin.

You might also like...

Implement A3C for Mujoco gym envs

pytorch-a3c-mujoco Disclaimer: my implementation right now is unstable (you ca refer to the learning curve below), I'm not sure if it's my problems. A

70 Dec 12, 2022

Perfect implement. Model shared. x0.5 (Top1:60.646) and 1.0x (Top1:69.402).

Shufflenet-v2-Pytorch Introduction This is a Pytorch implementation of faceplusplus's ShuffleNet-v2. For details, please read the following papers:

423 Dec 7, 2022

implement of SwiftNet:Real-time Video Object Segmentation

SwiftNet The official PyTorch implementation of SwiftNet:Real-time Video Object Segmentation, which has been accepted by CVPR2021. Requirements Python

64 Dec 14, 2022

The implement of papar "Enhanced Graph Learning for Collaborative Filtering via Mutual Information Maximization"

SIGIR2021-EGLN The implement of paper "Enhanced Graph Learning for Collaborative Filtering via Mutual Information Maximization" Neural graph based Col

15 Dec 27, 2022

a Pytorch easy re-implement of "YOLOX: Exceeding YOLO Series in 2021"

A pytorch easy re-implement of "YOLOX: Exceeding YOLO Series in 2021" 1. Notes This is a pytorch easy re-implement of "YOLOX: Exceeding YOLO Series in

91 Dec 26, 2022

PyTorch Implement of Context Encoders: Feature Learning by Inpainting

Context Encoders: Feature Learning by Inpainting This is the Pytorch implement of CVPR 2016 paper on Context Encoders 1) Semantic Inpainting Demo Inst

321 Dec 25, 2022

Implement Decoupled Neural Interfaces using Synthetic Gradients in Pytorch

disclaimer: this code is modified from pytorch-tutorial Image classification with synthetic gradient in Pytorch I implement the Decoupled Neural Inter

114 Dec 22, 2022

Demonstrates how to divide a DL model into multiple IR model files (division) and introduce a simplest way to implement a custom layer works with OpenVINO IR models.

Demonstration of OpenVINO techniques - Model-division and a simplest-way to support custom layers Description: Model Optimizer in Intel(r) OpenVINO(tm

12 Nov 9, 2022

Implement some metaheuristics and cost functions

Metaheuristics This repot implement some metaheuristics and cost functions. Metaheuristics JAYA Implement Jaya optimizer without constraints. Cost fun

1 Mar 23, 2022

Official implement of "CAT: Cross Attention in Vision Transformer".

Related tags

Overview

CAT: Cross Attention in Vision Transformer

Abstract

Pretrained Models and Results on ImageNet-1K

Models and Results on Object Detection (COCO 2017 val)

Models and Results on Semantic Segmentation (ADE20K val)

Citing CAT

Started

Acknowledgement

You might also like...

Implement A3C for Mujoco gym envs

Perfect implement. Model shared. x0.5 (Top1:60.646) and 1.0x (Top1:69.402).

implement of SwiftNet:Real-time Video Object Segmentation

The implement of papar "Enhanced Graph Learning for Collaborative Filtering via Mutual Information Maximization"

a Pytorch easy re-implement of "YOLOX: Exceeding YOLO Series in 2021"

PyTorch Implement of Context Encoders: Feature Learning by Inpainting

Implement Decoupled Neural Interfaces using Synthetic Gradients in Pytorch

Demonstrates how to divide a DL model into multiple IR model files (division) and introduce a simplest way to implement a custom layer works with OpenVINO IR models.

Implement some metaheuristics and cost functions

Releases(v1.0)

v1.0(Jun 5, 2022)

Owner

Weight estimation in CT by multi atlas techniques

Robot Reinforcement Learning on the Constraint Manifold

A Python library for Deep Probabilistic Modeling

Investigating automatic navigation towards standard US views integrating MARL with the virtual US environment developed in CT2US simulation

Multi-agent reinforcement learning algorithm and environment

Gluon CV Toolkit

The Incredible PyTorch: a curated list of tutorials, papers, projects, communities and more relating to PyTorch.

Teaching end to end workflow of deep learning

Repo for the paper Extrapolating from a Single Image to a Thousand Classes using Distillation

JupyterLite demo deployed to GitHub Pages 🚀

Code repository for our paper "Learning to Generate Scene Graph from Natural Language Supervision" in ICCV 2021

Code release for Hu et al. Segmentation from Natural Language Expressions. in ECCV, 2016

Code for NeurIPS 2021 paper "Curriculum Offline Imitation Learning"

Unified Interface for Constructing and Managing Workflows on different workflow engines, such as Argo Workflows, Tekton Pipelines, and Apache Airflow.

This is a model to classify Vietnamese sign language using Motion history image (MHI) algorithm and CNN.

Data reduction pipeline for KOALA on the AAT.

Transformer Huffman coding - Complete Huffman coding through transformer

A testcase generation tool for Persistent Memory Programs.

PyTorch implementation for the visual prior component (i.e. perception module) of the Visually Grounded Physics Learner [Li et al., 2020].

Self-Supervised Pillar Motion Learning for Autonomous Driving (CVPR 2021)