X-VLM: Multi-Grained Vision Language Pre-Training

Overview

X-VLM: learning multi-grained vision language alignments

Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts. Yan Zeng, Xinsong Zhang, Hang Li. arXiv 2021.

  • Jan 2022: release official PyTorch implementation and X-VLM-base checkpoints
  • Dec 2021: X-VLM-base (4M) achieves new SoTA
  • Nov 2021: release preprint in arXiv

Hiring

We are looking for interns at ByteDance AI-LAB (in Beijing / Shanghai)! If you are interested in working with us on vision language models, please send your resume to [email protected].

Features

  • Support several backbones
    • vision encoder: deit / clip-vit / swin-transformer
    • text encoder: bert / roberta
  • Support apex O1 / O2 for pre-training
  • Read from and write to HDFS
  • Distributed training across nodes for both pre-training and fine-tuning

Please read the code for more details.

Requirements

  • Install python3 environment
pip3 install -r requirements.txt
  • Download raw images from corresponding websites
  • Download the json files we provided, which contains image read paths and captions and/or bbox annotations
  • If running pre-training scripts:
  • Organize these files like this (% is for pre-training only):
X-VLM/
    data/
        finetune/
            refcoco+/*.json
            *.json
        
        %pretrain_4m/*.json
        %swin_base_patch4_window7_224_22k.pth
        %bert-base-uncased/
            config.json
            pytorch_model.bin
            tokenizer_config.json
            tokenizer.json
            vocab.txt

    images/
        coco/
            train2014/*.jpg
            val2014/*.jpg
            test2015/*.jpg
        
        visualgenome/
            image/*.jpg
        
        nlvr2/
            images/
                train/0-99/*.png
            dev/*.png
            test1/*.png
        
        %sbu/*.jpg
        %cc-3m/*.jpg

Pretrain

python3 run.py --task "pretrain_4m_base" --dist "1" --output_dir "output/pretrain_4m_base"

For distributed training across nodes, see run.py for more details.

Data

We are organizing the data and the scripts. All these will be released in Vision-Language-Data in March. Please feel free to prepare your own datasets by referring the code in dataset/pretrain_dataset.py.

Checkpoints

X-VLM-base (4M)
X-VLM-base 14M, WIP
X-VLM-large 14M, WIP

Finetune

2 nodes for fine-tuning, specify --output_hdfs to save some tmp results. # evaluate python3 run.py --task "vqa" --dist "1" --evaluate --output_dir "output/vqa_eval" --checkpoint "4m_base_finetune/vqa/model_state_epoch_9.th" ">
# train
python3 run.py --task "vqa" --dist "1" --output_dir "output/vqa" --checkpoint "4m_base_model_state_step_199999.th"
python3 run.py --task "vqa" --dist "all" --output_dir "output/vqa" --output_hdfs "hdfs://xxx/vqa_tmp" --checkpoint "4m_base_model_state_step_199999.th"  # if using >2 nodes for fine-tuning, specify --output_hdfs to save some tmp results.

# evaluate
python3 run.py --task "vqa" --dist "1" --evaluate --output_dir "output/vqa_eval" --checkpoint "4m_base_finetune/vqa/model_state_epoch_9.th" 

See run.py for fine-tuning on other tasks (Retrieval, NLVR2, RefCOCO). We set some python assertions to help you run the code correctly. The fine-tuning scripts are based on ALBEF. We thank the author for opening source their code.

Data

download json files

Checkpoints and Logs

retrieval-mscoco
retrieval-flickr
vqa
nlvr2
refcoco
refcoco-bbox
Note that fine-tuning configs are given in "X-VLM/configs/*.yaml"

Citation

If you use this code, please considering citing:

@article{xvlm,
  title={Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts},
  author={Zeng, Yan and Zhang, Xinsong and Li, Hang},
  journal={arXiv preprint arXiv:2111.08276},
  year={2021}
}

Contact

For issues or help using this code, please submit a GitHub issue.

Owner
Yan Zeng
Yan Zeng
OOD Dataset Curator and Benchmark for AI-aided Drug Discovery

🔥 DrugOOD 🔥 : OOD Dataset Curator and Benchmark for AI Aided Drug Discovery This is the official implementation of the DrugOOD project, this is the

108 Dec 17, 2022
An implementation of Geoffrey Hinton's paper "How to represent part-whole hierarchies in a neural network" in Pytorch.

GLOM An implementation of Geoffrey Hinton's paper "How to represent part-whole hierarchies in a neural network" for MNIST Dataset. To understand this

50 Oct 19, 2022
A python library for time-series smoothing and outlier detection in a vectorized way.

tsmoothie A python library for time-series smoothing and outlier detection in a vectorized way. Overview tsmoothie computes, in a fast and efficient w

Marco Cerliani 517 Dec 28, 2022
DecoupledNet is semantic segmentation system which using heterogeneous annotations

DecoupledNet: Decoupled Deep Neural Network for Semi-supervised Semantic Segmentation Created by Seunghoon Hong, Hyeonwoo Noh and Bohyung Han at POSTE

Hyeonwoo Noh 74 Sep 22, 2021
Code for paper: "Spinning Language Models for Propaganda-As-A-Service"

Spinning Language Models for Propaganda-As-A-Service This is the source code for the Arxiv version of the paper. You can use this Google Colab to expl

Eugene Bagdasaryan 16 Jan 03, 2023
A pytorch implementation of faster RCNN detection framework (Use detectron2, it's a masterpiece)

Notice(2019.11.2) This repo was built back two years ago when there were no pytorch detection implementation that can achieve reasonable performance.

Ruotian(RT) Luo 1.8k Jan 01, 2023
Deep Learning for Natural Language Processing SS 2021 (TU Darmstadt)

Deep Learning for Natural Language Processing SS 2021 (TU Darmstadt) Task Training huge unsupervised deep neural networks yields to strong progress in

Oliver Hahn 1 Jan 26, 2022
Unified learning approach for egocentric hand gesture recognition and fingertip detection

Unified Gesture Recognition and Fingertip Detection A unified convolutional neural network (CNN) algorithm for both hand gesture recognition and finge

Mohammad 227 Dec 25, 2022
Explainable Medical ImageSegmentation via GenerativeAdversarial Networks andLayer-wise Relevance Propagation

MedAI: Transparency in Medical Image Segmentation What is this repo This repo contains the code and experiments that are implemented to contribute in

Awadelrahman M. A. Ahmed 1 Nov 22, 2021
Python package to add text to images, textures and different backgrounds

nider Python package for text images generation and watermarking Free software: MIT license Documentation: https://nider.readthedocs.io. nider is an a

Vladyslav Ovchynnykov 131 Dec 30, 2022
GRF: Learning a General Radiance Field for 3D Representation and Rendering

GRF: Learning a General Radiance Field for 3D Representation and Rendering [Paper] [Video] GRF: Learning a General Radiance Field for 3D Representatio

Alex Trevithick 243 Dec 29, 2022
PyBullet CartPole and Quadrotor environments—with CasADi symbolic a priori dynamics—for learning-based control and reinforcement learning

safe-control-gym Physics-based CartPole and Quadrotor Gym environments (using PyBullet) with symbolic a priori dynamics (using CasADi) for learning-ba

Dynamic Systems Lab 300 Dec 28, 2022
Official code release for 3DV 2021 paper Human Performance Capture from Monocular Video in the Wild.

Official code release for 3DV 2021 paper Human Performance Capture from Monocular Video in the Wild.

Chen Guo 58 Dec 24, 2022
Recommendation algorithms for large graphs

Fast recommendation algorithms for large graphs based on link analysis. License: Apache Software License Author: Emmanouil (Manios) Krasanakis Depende

Multimedia Knowledge and Social Analytics Lab 27 Jan 07, 2023
A Deep learning based streamlit web app which can tell with which bollywood celebrity your face resembles.

Project Name: Which Bollywood Celebrity You look like A Deep learning based streamlit web app which can tell with which bollywood celebrity your face

BAPPY AHMED 20 Dec 28, 2021
YOLOv5 detection interface - PyQt5 implementation

所有代码已上传,直接clone后,运行yolo_win.py即可开启界面。 2021/9/29:加入置信度选择 界面是在ultralytics的yolov5基础上建立的,界面使用pyqt5实现,内容较简单,娱乐而已。 功能: 模型选择 本地文件选择(视频图片均可) 开关摄像头

487 Dec 27, 2022
This program presents convolutional kernel density estimation, a method used to detect intercritical epilpetic spikes (IEDs)

Description This program presents convolutional kernel density estimation, a method used to detect intercritical epilpetic spikes (IEDs) in [Gardy et

Ludovic Gardy 0 Feb 09, 2022
Kaggle competition: Springleaf Marketing Response

PruebaEnel Prueba Kaggle-Springleaf-master Prueba Kaggle-Springleaf Kaggle competition: Springleaf Marketing Response Competencia de Kaggle: Marketing

1 Feb 09, 2022
Multiview 3D object detection on MultiviewC dataset through moft3d.

Voxelized 3D Feature Aggregation for Multiview Detection [arXiv] Multiview 3D object detection on MultiviewC dataset through VFA. Introduction We prop

Jiahao Ma 20 Dec 21, 2022
Code to train models from "Paraphrastic Representations at Scale".

Paraphrastic Representations at Scale Code to train models from "Paraphrastic Representations at Scale". The code is written in Python 3.7 and require

John Wieting 71 Dec 19, 2022