A Persian Image Captioning model based on Vision Encoder Decoder Models of the transformers🤗.

Last update: Aug 25, 2022

Overview

Persian-Image-Captioning

We fine-tuning the Vision Encoder Decoder Model for the task of image captioning on the coco-flickr-farsi dataset. The implementation of our model is in PyTorch with transformers library by Hugging Face( 🤗 ).

You can choose any pretrained vision model and any language model to use in the Vision Encoder Decoder model. Here we use ViT as the encoder, and ParsBERT (v2.0) as the decoder. The encoder and decoder are loaded separately via from_pretrained()function. Cross-attention layers are randomly initialized and added to the decoder.

You may refer to the Vision Encoder Decoder Model for more information.

How to use

You can generate caption of an image using this model using the code below:

import torch
import urllib
import PIL
import matplotlib.pyplot as plt
from transformers import ViTFeatureExtractor, AutoTokenizer, \
                         VisionEncoderDecoderModel

def show_img(image):
    # show image
    plt.axis("off")
    plt.imshow(image)
    
if torch.cuda.is_available():
    device = 'cuda'
else:
    device = 'cpu'


#pass the url of any image to generate a caption for it    
urllib.request.urlretrieve("https://images.unsplash.com/photo-1628191011227-522c7c3f0af9?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=870&q=80", "sample.png")
image = PIL.Image.open("sample.png")


#Load the model you trained for inference 
model_checkpoint = 'MahsaShahidi/Persian-Image-Captioning'
model = VisionEncoderDecoderModel.from_pretrained(model_checkpoint).to(device)

feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224-in21k')
tokenizer = AutoTokenizer.from_pretrained('HooshvareLab/bert-fa-base-uncased-clf-persiannews')

sample = feature_extractor(image, return_tensors="pt").pixel_values.to(device)
caption_ids = model.generate(sample, max_length = 30)[0]
caption_text = tokenizer.decode(caption_ids, skip_special_tokens=True)
print(caption_text)
show_img(image)

Inference

Following are the reslts of 3 captions generated on free stock photos after 2 epochs of training.

Image	Caption
	Generated Caption: زنی در آشپزخانه در حال اماده کردن غذا است.
	Generated Caption: گروهی از مردم در حال پرواز بادبادک در یک زمین چمنزار.
	Generated Caption: مردی در ماشین نشسته و به ماشین نگاه می کند.

Credits

A huge thanks to Kaggle for providing free access to GPU, and to the creators of Huggingface, ViT, and ParsBERT!

References

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A Persian Image Captioning model based on Vision Encoder Decoder Models of the transformers🤗.

Related tags

Overview

Persian-Image-Captioning

How to use

Inference

Credits

References

Owner

Hamtech-ai

NumPy String-Indexed is a NumPy extension that allows arrays to be indexed using descriptive string labels

Biterm Topic Model (BTM): modeling topics in short texts

The (extremely) naive sentiment classification function based on NBSVM trained on wisesight_sentiment

🧪 Cutting-edge experimental spaCy components and features

Fuzzy String Matching in Python

Source code for the paper "TearingNet: Point Cloud Autoencoder to Learn Topology-Friendly Representations"

Predicting the usefulness of reviews given the review text and metadata surrounding the reviews.

Train 🤗transformers with DeepSpeed: ZeRO-2, ZeRO-3

NL-Augmenter 🦎 → 🐍 A Collaborative Repository of Natural Language Transformations

Bpe algorithm can finetune tokenizer - Bpe algorithm can finetune tokenizer

Universal Adversarial Triggers for Attacking and Analyzing NLP (EMNLP 2019)

This is a GUI program that will generate a word search puzzle image

Index different CKAN entities in Solr, not just datasets

Simple virtual assistant using pyttsx3 and speech recognition optionally with pywhatkit and pther libraries.

⛵️The official PyTorch implementation for "BERT-of-Theseus: Compressing BERT by Progressive Module Replacing" (EMNLP 2020).

Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

A repository to run gpt-j-6b on low vram machines (4.2 gb minimum vram for 2000 token context, 3.5 gb for 1000 token context). Model loading takes 12gb free ram.

A Flask Sentiment Analysis API, with visual implementation

ACL'2021: Learning Dense Representations of Phrases at Scale

小布助手对话短文本语义匹配的一个baseline