"Exploring Vision Transformers for Fine-grained Classification" at CVPRW FGVC8

Last update: Dec 06, 2022

Overview

FGVC8

Exploring Vision Transformers for Fine-grained Classification paper presented at the CVPR 2021, The Eight Workshop on Fine-Grained Visual Categorization on June 25th.

Abstract

Existing computer vision research in categorization struggles with fine-grained attributes recognition due to the inherently high intra-class variances and low inter-class variances. SOTA methods tackle this challenge by locating the most informative image regions and rely on them to classify the complete image. The most recent work, Vision Transformer (ViT), shows its strong performance in both traditional and fine-grained classification tasks.

In this work, we propose a multi-stage ViT framework for fine-grained image classification tasks, which localizes the informative image regions without requiring architectural changes using the inherent multi-head self-attention mechanism. We also introduce attention-guided augmentations for improving the model's capabilities.

We demonstrate the value of our approach by experimenting with four popular fine-grained benchmarks: CUB-200-2011, Stanford Cars, Stanford Dogs, and FGVC7 Plant Pathology. We also prove our model's interpretability via qualitative results.

Instructions

Upcoming

Citation

If you find interesting our results, or you use or code/ideas please consider to cite our work:

@misc{conde2021exploring,
      title={Exploring Vision Transformers for Fine-grained Classification}, 
      author={Marcos V. Conde and Kerem Turgutlu},
      year={2021},
      eprint={2106.10587},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

"Exploring Vision Transformers for Fine-grained Classification" at CVPRW FGVC8

Related tags

Overview

FGVC8

Abstract

Instructions

Citation

References

Owner

Marcos V. Conde

PyTorch implementation of saliency map-aided GAN for Auto-demosaic+denosing

GAN encoders in PyTorch that could match PGGAN, StyleGAN v1/v2, and BigGAN. Code also integrates the implementation of these GANs.

A state of the art of new lightweight YOLO model implemented by TensorFlow 2.

PN-Net a neural field-based framework for depth estimation from single-view RGB images.

Deeper insights into graph convolutional networks for semi-supervised learning

Dictionary Learning with Uniform Sparse Representations for Anomaly Detection

Deep Residual Learning for Image Recognition

DeLighT: Very Deep and Light-Weight Transformers

Face and Pose detector that emits MQTT events when a face or human body is detected and not detected.

The end-to-end platform for building voice products at scale

Python-kafka-reset-consumergroup-offset-example - Python Kafka reset consumergroup offset example

[CVPR 2021] Unsupervised Degradation Representation Learning for Blind Super-Resolution

Code for "Long Range Probabilistic Forecasting in Time-Series using High Order Statistics"

[ICCV 2021 Oral] Just Ask: Learning to Answer Questions from Millions of Narrated Videos

Deep Learning Visuals contains 215 unique images divided in 23 categories

RNN Predict Street Commercial Vitality

Deep Video Matting via Spatio-Temporal Alignment and Aggregation [CVPR2021]

ColossalAI-Examples - Examples of training models with hybrid parallelism using ColossalAI

This is a deep learning-based method to segment deep brain structures and a brain mask from T1 weighted MRI.

TransVTSpotter: End-to-end Video Text Spotter with Transformer