Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features

Last update: Dec 28, 2022

Related tags

Overview

Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features

Official PyTorch implementation for Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features (MATRN).

This paper introduces a novel method, called Multi-modAl Text Recognition Network (MATRN), that enables interactions between visual and semantic features for better recognition performances.

Datasets

We use lmdb dataset for training and evaluation dataset. The datasets can be downloaded in clova (for validation and evaluation) and ABINet (for training and evaluation).

Training datasets
Validation datasets
- The union of the training set of ICDAR2013, ICDAR2015, IIIT5K, and Street View Text
Evaluation datasets
- Regular datasets
  - IIIT5K (IIIT)
  - Street View Text (SVT)
  - ICDAR2013: IC13_S with 857 images, IC13_L with 1015 images
- Irregular dataset
  - ICDAR2015: IC15_S with 1811 images, IC15_L with 2077 images
  - Street View Text Perspective (SVTP)
  - CUTE80 (CUTE)

Tree structure of data directory

data
├── charset_36.txt
├── evaluation
│   ├── CUTE80
│   ├── IC13_857
│   ├── IC13_1015
│   ├── IC15_1811
│   ├── IC15_2077
│   ├── IIIT5k_3000
│   ├── SVT
│   └── SVTP
├── training
│   ├── MJ
│   │   ├── MJ_test
│   │   ├── MJ_train
│   │   └── MJ_valid
│   └── ST
├── validation
├── WikiText-103.csv
└── WikiText-103_eval_d1.csv

Requirements

pip install torch==1.7.1 torchvision==0.8.2 fastai==1.0.60 lmdb pillow opencv-python

Pretrained Models

Download pretrained model of MATRN from this link. Performances of the pretrained models are:

Model	IIIT	SVT	IC13_S	IC13_L	IC15_S	IC15_L	SVTP	CUTE
MATRN	96.7	94.9	97.9	95.8	86.6	82.9	90.5	94.1

If you want to train with pretrained visioan and language model, download pretrained model of vision and language model from ABINet (for training and evaluation).

Training and Evaluation

Training

python main.py --config=configs/train_matrn.yaml

Evaluation

python main.py --config=configs/train_matrn.yaml --phase test --image_only

Additional flags:

--checkpoint /path/to/checkpoint set the path of evaluation model
--test_root /path/to/dataset set the path of evaluation dataset
--model_eval [alignment|vision|language] which sub-model to evaluate
--image_only disable dumping visualization of attention masks

Acknowledgements

This implementation has been based on ABINet.

Citation

Please cite this work in your publications if it helps your research.

@article{na2021multi,
  title={Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features},
  author={Na, Byeonghu and Kim, Yoonsik and Park, Sungrae},
  journal={arXiv preprint arXiv:2111.15263},
  year={2021}
}

Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features

Related tags

Overview

Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features

Datasets

Requirements

Pretrained Models

Training and Evaluation

Acknowledgements

Citation

Owner

PyTorch implementation of MulMON

Makes patches from huge resolution .svs slide files using openslide

Easy and Efficient Object Detector

This repo contains the pytorch implementation for Dynamic Concept Learner (accepted by ICLR 2021).

Official repo of the paper "Surface Form Competition: Why the Highest Probability Answer Isn't Always Right"

Reinforcement Learning for Portfolio Management

TLDR: Twin Learning for Dimensionality Reduction

Unofficial Tensorflow-Keras implementation of Fastformer based on paper [Fastformer: Additive Attention Can Be All You Need](https://arxiv.org/abs/2108.09084).

Implementation of ViViT: A Video Vision Transformer

Repository for self-supervised landmark discovery

Facial recognition project

Fully Convolutional Networks for Semantic Segmentation by Jonathan Long, Evan Shelhamer, and Trevor Darrell. CVPR 2015 and PAMI 2016.

A tool to analyze leveraged liquidity mining and find optimal option combination for hedging.

MetaTTE: a Meta-Learning Based Travel Time Estimation Model for Multi-city Scenarios

dataset for ECCV 2020 "Motion Capture from Internet Videos"

A unified framework for machine learning with time series

Hybrid Neural Fusion for Full-frame Video Stabilization

DCT-Mask: Discrete Cosine Transform Mask Representation for Instance Segmentation

PyTorch implementation of EfficientNetV2

PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World [ACL 2021]

Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features

Related tags

Overview

Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features

Datasets

Requirements

Pretrained Models

Training and Evaluation

Acknowledgements

Citation

Owner

PyTorch implementation of MulMON

Makes patches from huge resolution .svs slide files using openslide

Easy and Efficient Object Detector

This repo contains the pytorch implementation for Dynamic Concept Learner (accepted by ICLR 2021).

Official repo of the paper "Surface Form Competition: Why the Highest Probability Answer Isn't Always Right"

Reinforcement Learning for Portfolio Management

TLDR: Twin Learning for Dimensionality Reduction

Unofficial Tensorflow-Keras implementation of Fastformer based on paper [Fastformer: Additive Attention Can Be All You Need](https://arxiv.org/abs/2108.09084).

Implementation of ViViT: A Video Vision Transformer

Repository for self-supervised landmark discovery

Facial recognition project

Fully Convolutional Networks for Semantic Segmentation by Jonathan Long*, Evan Shelhamer*, and Trevor Darrell. CVPR 2015 and PAMI 2016.

A tool to analyze leveraged liquidity mining and find optimal option combination for hedging.

MetaTTE: a Meta-Learning Based Travel Time Estimation Model for Multi-city Scenarios

dataset for ECCV 2020 "Motion Capture from Internet Videos"

A unified framework for machine learning with time series

Hybrid Neural Fusion for Full-frame Video Stabilization

DCT-Mask: Discrete Cosine Transform Mask Representation for Instance Segmentation

PyTorch implementation of EfficientNetV2

PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World [ACL 2021]

Fully Convolutional Networks for Semantic Segmentation by Jonathan Long, Evan Shelhamer, and Trevor Darrell. CVPR 2015 and PAMI 2016.