OpenAI CLIP text encoders for multiple languages!

Overview

Multilingual-CLIP

OpenAI CLIP text encoders for any language

Colab Notebook · Pre-trained Models · Report Bug

Open In Colab

Overview

Alt text

OpenAI recently released the paper Learning Transferable Visual Models From Natural Language Supervision in which they present the CLIP (Contrastive Language–Image Pre-training) model. This model is trained to connect text and images, by matching their corresponding vector representations using a contrastive learning objective. CLIP consists of two separate models, a visual encoder and a text encoder. These were trained on a wooping 400 Million images and corresponding captions. OpenAI has since released a set of their smaller CLIP models, which can be found on the official CLIP Github.

We propose a fine-tuning to replace the original English text encoder with a pre-trained text model in any language. This method makes it possible to adapt the powerful CLIP model to any language in roughly 24 GPU hours.

This repository contains

  • Pytorch inference code
  • Tensorflow training code
  • Pre-trained CLIP-Text encoders for multiple languages
  • Training data and pre-computed CLIP text encodings for a large porton of the the image captions of GCC + MSCOCO + VizWiz

Requirements

While it is possible that other versions works equally fine, we have worked with the following:

  • Python = 3.6.9
  • Transformers = 4.1.1
  • Model Weights

Usage

Download CLIP Model
$ conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
$ pip install ftfy regex tqdm
$ pip install git+https://github.com/openai/CLIP.git

Replace cudatoolkit=11.0 above with the appropriate CUDA version on your machine or cpuonly when installing on a machine without a GPU. For more information please see the official CLIP repostitory.

Download Linear Weights
# Linear Model Weights
$ bash get-weights.sh

Inference

from src import multilingual_clip

print(multilingual_clip.AVAILABLE_MODELS.keys())

model = multilingual_clip.load_model('M-BERT-Distil-40')

embeddings = model(['Älgen är skogens konung!', 'Wie leben Eisbären in der Antarktis?', 'Вы знали, что все белые медведи левши?'])
print(embeddings.shape)
# Yields: torch.Size([3, 640])

For a more elaborate example, comparing the textual embeddings to the CLIP image embeddings see this colab notebook.

Pre-trained Models

Every text encoder is a Huggingface available transformer, with an additional linear layer on top. Neither of the models have been extensively tested, but for more information and qualitative test results for a specific model, click the Model Name to see its model card.

*** Make sure to update to the most recent version of the repostitory when downloading a new model, and re-run the shell script to download the Linear Weights. ***

M-BERT-Base-ViT-B

Name Model Base Vision Model Pre-trained Languages Target Languages #Parameters
Multilingual
M-BERT Distil 40 M-BERT Distil RN50x4 101 Languages 40 Languages 66 M
M-BERT Base 69 M-BERT Base RN50x4 101 Languages 68 Languages 110 M
M-BERT Base ViT-B M-BERT Base ViT-B/32 101 Languages 68 Languages 110 M
Monolingual
Swe-CLIP 500k KB-BERT RN50x4 Swedish Swedish 110 M
Swe-CLIP 2M KB-BERT RN50x4 Swedish Swedish 110 M

Training a new model

This folder contains the code used for training the above models. If you wsh to train your own model you must do the following things:

  • Prepare a set of translated sentence pairs from English -> Your Language(s)
  • Compute regular CLIP-Text embeddings for the English sentences.
  • Edit Training.py to load your data.
  • Train a new CLIP-Text encoder via Teacher Learning

Pre-computed CLIP Embeddings & Translaton Data

This Google Drive folder contains both pre-computed CLIP-Text Embeddings for a large porton of the the image captions of GCC + MSCOCO + VizWiz.

The Google Drive folder also contains the translation data used to train the currently available models. Good Luck

Contribution

If you have trained a CLIP Text encoder specific to your language, or another model covering a language not supported here, Please feel free to contact us and we will either upload your model and credit you, or simply link to your already uploaded model.

Contact

If you have questions regarding the code or otherwise related to this Github page, please open an issue.

For other purposes, feel free to contact me directly at: [email protected]

Acknowledgements

License

Distributed under the MIT License. See LICENSE for more information.

Comments
  • 1024 dim embedding model needed

    1024 dim embedding model needed

    Dear authors , Thx 4 your Great work ! But I'm working with AudioClip of which the embeddings are 1024 dims, But the models U've released have most 768 dims, Could U pls kindly release a model that can produce 1024 dims embedding ? Here is AudioClip : https://github.com/AndreyGuzhov/AudioCLIP With my best wish ! Looking forward to hearing from U !

    opened by ithanwu 4
  • Release 1.0.0

    Release 1.0.0

    Merge this only after doing this:

    when you create a "Release x.y.z" it will release

    you need to add a secret called PYPI_PASSWORD in the github repo and put inside a token you create at https://pypi.org/manage/account/token/

    https://github.com/FreddeFrallan/Multilingual-CLIP/settings/secrets/actions/new

    Choose the option "Squash and merge" on github when merging to create a single commit

    opened by rom1504 4
  • Bibtex Citation

    Bibtex Citation

    Amazing repo! I'd love to cite it. Do you have a desired bibtex by chance?

    Perhaps:

    @misc{multilingual-clip, author = {Carlsson, Fredrik}, title = {Multilingual CLIP}, year = 2021, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/SajjjadAyobi/CLIPfa}}, }

    opened by Zasder3 1
  • Training a model for ViT-L/14 image embeddings

    Training a model for ViT-L/14 image embeddings

    Hey, Thanks for providing this awesome multilingual clip-aligned text encoder. We used it to filter the 3 billions of (image, text) pairs of laion5B https://laion.ai/laion-5b-a-new-era-of-open-large-scale-multi-modal-datasets/ and it worked well. I'm also using this model to provide a multilingual search in https://rom1504.github.io/clip-retrieval/. For laion400m we used the ViT-B/32 model of openai to produce the index, but for laion5B we went with ViT-L/14 which is much more powerful. To provide the same multilingual search feature, it would be really helpful if I had a clip ViT-L/14 aligned multilingual text encoder.

    Would you advise running https://github.com/FreddeFrallan/Multilingual-CLIP#training-a-new-model (and now I'm writing it, I guess I could use a subset of the multilingual set of laion5B for this) to align such a text encoder ?

    opened by rom1504 1
  • XLM-Roberta Feature Request

    XLM-Roberta Feature Request

    Hi,

    Great repo! Are you planning to release a model with XLMR (with ViT-B) anytime soon? It was better for small-resource-languages than the multilingual BERT.

    opened by mezig351 1
  • Redo packaging

    Redo packaging

    when you create a "Release x.y.z" it will release

    you need to add a secret called PYPI_PASSWORD in the github repo and put inside a token you create at https://pypi.org/manage/account/token/

    https://github.com/FreddeFrallan/Multilingual-CLIP/settings/secrets/actions/new

    opened by rom1504 0
  • Data leak

    Data leak

    Hello! According to XTD-10 repo, the test set contains 800 images from MSCOCO train set. During training you also use MSCOCO train set – it seems you have data leak. Or may be I don't understand something.

    opened by kimihailv 1
  • model_type 'M-CLIP' is not in CONFIG_MAPPING

    model_type 'M-CLIP' is not in CONFIG_MAPPING

    from transformers import AutoConfig
    
    kwargs = {'_from_auto': True}
    pretrained_model_name_or_path = 'M-CLIP/XLM-Roberta-Large-Vit-L-14'
    config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
    

    Hi, I installed required transformers==4.8.1, and run the above code to get following error.

        config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
      File "/anaconda3/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 448, in from_pretrained
        config_class = CONFIG_MAPPING[config_dict["model_type"]]
    KeyError: 'M-CLIP'
    

    seems like model_type 'M-CLIP' is not in the CONFIG_MAPPING, can anyone help to figure it out?

    opened by wxywb 1
  • Issue in M-Bert-Base-ViT-B clip head linear layer size

    Issue in M-Bert-Base-ViT-B clip head linear layer size

    I tried the following piece of code present in the repo at location https://github.com/FreddeFrallan/Multilingual-CLIP/blob/main/src/multilingual_clip.py

    The only changes I made is that I added print statements in between.


    ` import pickle

    import torch import transformers

    AVAILABLE_MODELS = { 'M-BERT-Distil-40': { 'model_name': 'M-CLIP/M-BERT-Distil-40', 'tokenizer_name': 'M-CLIP/M-BERT-Distil-40', 'head_name': 'M-BERT Distil 40 Linear Weights.pkl' },

    'M-BERT-Base-69': {
        'model_name': 'M-CLIP/M-BERT-Base-69',
        'tokenizer_name': 'M-CLIP/M-BERT-Base-69',
        'head_name': 'M-BERT-Base-69 Linear Weights.pkl'
    },
    
    'Swe-CLIP-500k': {
        'model_name': 'M-CLIP/Swedish-500k',
        'tokenizer_name': 'M-CLIP/Swedish-500k',
        'head_name': 'Swedish-500k Linear Weights.pkl'
    },
    
    'Swe-CLIP-2M': {
        'model_name': 'M-CLIP/Swedish-2M',
        'tokenizer_name': 'M-CLIP/Swedish-2M',
        'head_name': 'Swedish-2M Linear Weights.pkl'
    },
    
    'M-BERT-Base-ViT-B': {
        'model_name': 'M-CLIP/M-BERT-Base-ViT-B',
        'tokenizer_name': 'M-CLIP/M-BERT-Base-ViT-B',
        'head_name': 'M-BERT-Base-69-ViT Linear Weights.pkl'
    },
    

    }

    class MultilingualClip2(torch.nn.Module): def init(self, model_name, tokenizer_name, head_name, weights_dir='data/weights/'): super().init() self.model_name = model_name self.tokenizer_name = tokenizer_name self.head_path = weights_dir + head_name

        self.tokenizer = transformers.AutoTokenizer.from_pretrained(tokenizer_name)
        self.transformer = transformers.AutoModel.from_pretrained(model_name)
        self.clip_head = torch.nn.Linear(in_features=768, out_features=640)
        self._load_head()
    
    def forward(self, txt):
        txt_tok = self.tokenizer(txt, padding=True, return_tensors='pt').to(device)
        embs = self.transformer(**txt_tok)[0]
        print('embs_text')
        print(embs.size())
        att = txt_tok['attention_mask']
        print('att_text')
        print(att.size())
        embs = (embs * att.unsqueeze(2)).sum(dim=1) / att.sum(dim=1)[:, None]
        print('embs_text')
        print(embs.size())
        p =  self.clip_head(embs)
        print('clip head obj')
        print(self.clip_head)
        print('cliphed_text')
        print(p.size())
        return p
    
    def _load_head(self):
        with open(self.head_path, 'rb') as f:
            lin_weights = pickle.loads(f.read())
        self.clip_head.weight = torch.nn.Parameter(torch.tensor(lin_weights[0]).float().t())
        self.clip_head.bias = torch.nn.Parameter(torch.tensor(lin_weights[1]).float())
        print('ok')
        print(self.clip_head.weight.size())
        print(self.clip_head.bias.size())
    

    def load_model2(name): config = AVAILABLE_MODELS[name] return MultilingualClip2(**config)

    mod = load_model2('M-BERT-Base-ViT-B') z = mod(Query[0]) `

    Output for this code : ok torch.Size([512, 768]) torch.Size([512]) embs_text torch.Size([1, 6, 768]) att_text torch.Size([1, 6]) embs_text torch.Size([1, 768]) clip head obj Linear(in_features=768, out_features=640, bias=True) cliphed_text torch.Size([1, 512])


    This output suggest that the file 'M-BERT-Base-69-ViT Linear Weights.pkl' doesn't have the size of 640 X 768 but 512 X 768

    Is there any issue with the config then ?

    opened by shreyajain4 2
  • some questisons about finetune

    some questisons about finetune

    i have finetune the text_encode use 300000 texts and its embedding,but i find the result is so bad ,could you give me some advertise to improve the result

    opened by Soulscb 1
  • some confuse for

    some confuse for "Pre-trained CLIP-Text encoders for multiple languages"

    if i have <other language text , image, label> pair data, can i directly use 'distilbert-base-multilingual-cased' to pre-train clip model? Why re-train a model for englist to other languages ?

    opened by moluchase 1
Owner
Fredrik Carlsson
Fredrik Carlsson
OceanScript is an Esoteric language used to encode and decode text into a formulation of characters

OceanScript is an Esoteric language used to encode and decode text into a formulation of characters - where the final result looks like waves in the ocean.

ASCEND Chinese-English code-switching dataset

ASCEND (A Spontaneous Chinese-English Dataset) introduces a high-quality resource of spontaneous multi-turn conversational dialogue Chinese-English code-switching corpus collected in Hong Kong.

CAiRE 11 Dec 09, 2022
Easy to start. Use deep nerual network to predict the sentiment of movie review.

Easy to start. Use deep nerual network to predict the sentiment of movie review. Various methods, word2vec, tf-idf and df to generate text vectors. Various models including lstm and cov1d. Achieve f1

1 Nov 19, 2021
A high-level yet extensible library for fast language model tuning via automatic prompt search

ruPrompts ruPrompts is a high-level yet extensible library for fast language model tuning via automatic prompt search, featuring integration with Hugg

Sber AI 37 Dec 07, 2022
Torchrecipes provides a set of reproduci-able, re-usable, ready-to-run RECIPES for training different types of models, across multiple domains, on PyTorch Lightning.

Recipes are a standard, well supported set of blueprints for machine learning engineers to rapidly train models using the latest research techniques without significant engineering overhead.Specifica

Meta Research 193 Dec 28, 2022
Задания КЕГЭ по информатике 2021 на Python

КЕГЭ 2021 на Python В этом репозитории мои решения типовых заданий КЕГЭ по информатике в 2021 году, БЕСПЛАТНО! Задания Взяты с https://inf-ege.sdamgia

8 Oct 13, 2022
Minimal GUI for accessing the Watson Text to Speech service.

Description Minimal graphical application for accessing the Watson Text to Speech service. Requirements Python 3 plus all dependencies listed in requi

Moritz Maxeiner 1 Oct 22, 2021
Repositório da disciplina no semestre 2021-2

Avisos! Nenhum aviso! Compiladores 1 Este é o Git da disciplina Compiladores 1. Aqui ficará o material produzido em sala de aula assim como tarefas, w

6 May 13, 2022
Twitter bot that uses NLP models to summarize news articles referenced in a user's twitter timeline

Twitter-News-Summarizer Twitter bot that uses NLP models to summarize news articles referenced in a user's twitter timeline 1.) Extracts all tweets fr

Rohit Govindan 1 Jan 27, 2022
code for modular summarization work published in ACL2021 by Krishna et al

This repository contains the code for running modular summarization pipelines as described in the publication Krishna K, Khosla K, Bigham J, Lipton ZC

Approximately Correct Machine Intelligence (ACMI) Lab 21 Nov 24, 2022
Use PaddlePaddle to reproduce the paper:mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

MT5_paddle Use PaddlePaddle to reproduce the paper:mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer English | 简体中文 mT5: A Massively

2 Oct 17, 2021
用Resnet101+GPT搭建一个玩王者荣耀的AI

基于pytorch框架用resnet101加GPT搭建AI玩王者荣耀 本源码模型主要用了SamLynnEvans Transformer 的源码的解码部分。以及pytorch自带的预训练模型"resnet101-5d3b4d8f.pth"

冯泉荔 2.2k Jan 03, 2023
Using context-free grammar formalism to parse English sentences to determine their structure to help computer to better understand the meaning of the sentence.

Sentance Parser Executing the Program Make sure Python 3.6+ is installed. Install requirements $ pip install requirements.txt Run the program:

Vaibhaw 12 Sep 28, 2022
XLNet: Generalized Autoregressive Pretraining for Language Understanding

Introduction XLNet is a new unsupervised language representation learning method based on a novel generalized permutation language modeling objective.

Zihang Dai 6k Jan 07, 2023
SciBERT is a BERT model trained on scientific text.

SciBERT is a BERT model trained on scientific text.

AI2 1.2k Dec 24, 2022
LSTC: Boosting Atomic Action Detection with Long-Short-Term Context

LSTC: Boosting Atomic Action Detection with Long-Short-Term Context This Repository contains the code on AVA of our ACM MM 2021 paper: LSTC: Boosting

Tencent YouTu Research 9 Oct 11, 2022
NVDA, the free and open source Screen Reader for Microsoft Windows

NVDA NVDA (NonVisual Desktop Access) is a free, open source screen reader for Microsoft Windows. It is developed by NV Access in collaboration with a

NV Access 1.6k Jan 07, 2023
Transformers and related deep network architectures are summarized and implemented here.

Transformers: from NLP to CV This is a practical introduction to Transformers from Natural Language Processing (NLP) to Computer Vision (CV) Introduct

Ibrahim Sobh 138 Dec 27, 2022
Label data using HuggingFace's transformers and automatically get a prediction service

Label Studio for Hugging Face's Transformers Website • Docs • Twitter • Join Slack Community Transfer learning for NLP models by annotating your textu

Heartex 135 Dec 29, 2022
Augmenty is an augmentation library based on spaCy for augmenting texts.

Augmenty: The cherry on top of your NLP pipeline Augmenty is an augmentation library based on spaCy for augmenting texts. Besides a wide array of high

Kenneth Enevoldsen 124 Dec 29, 2022