(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Last update: Dec 04, 2022

Related tags

Deep Learning Kaleido-BERT

Overview

Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Mingchen Zhuge*, Dehong Gao*, Deng-Ping Fan#, Linbo Jin, Ben Chen, Haoming Zhou, Minghui Qiu, Ling Shao.

[Paper][中文版][Video][Poster][MSRA_Slide][News1][New2][MSRA_Talking][机器之心_Talking]

Introduction

We present a new vision-language (VL) pre-training model dubbed Kaleido-BERT, which introduces a novel kaleido strategy for fashion cross-modality representations from transformers. In contrast to random masking strategy of recent VL models, we design alignment guided masking to jointly focus more on image-text semantic relations. To this end, we carry out five novel tasks, \ie, rotation, jigsaw, camouflage, grey-to-color, and blank-to-color for self-supervised VL pre-training at patches of different scale. Kaleido-BERT is conceptually simple and easy to extend to the existing BERT framework, it attains state-of-the-art results by large margins on four downstream tasks, including text retrieval ([email protected]: 4.03% absolute improvement), image retrieval ([email protected]: 7.13% abs imv.), category recognition (ACC: 3.28% abs imv.), and fashion captioning (Bleu4: 1.2 abs imv.). We validate the efficiency of Kaleido-BERT on a wide range of e-commercial websites, demonstrating its broader potential in real-world applications.

Noted

Code will be released in 2021/4/16.
This is the tensorflow implementation built on Alibaba/EasyTransfer. We will also release a Pytorch version built on Huggingface/Transformers in future.
If you feel hard to download these datasets, please modify /dataset/get_pretrain_data.sh, /dataset/get_finetune_data.sh, /dataset/get_retrieve_data.sh, and comment out some wget #file_links as you want. This will not inhibit following implementation.

Get started

Clone this code

git clone [email protected]:mczhuge/Kaleido-BERT.git
cd Kaleido-BERT

Enviroment setup (Details can be found on conda_env.info)

conda create  --name kaleidobert --file conda_env.info
conda activate kaleidobert
conda install tensorflow==1.15.0
pip install boto3 tqdm tensorflow_datasets --index-url=https://mirrors.aliyun.com/pypi/simple/
pip install sentencepiece==0.1.92 sklearn --index-url=https://mirrors.aliyun.com/pypi/simple/
pip install joblib==0.14.1
python setup.py develop

Download Pretrained Dependancy

cd Kaleido-BERT/scripts/checkpoint
sh get_checkpoint.sh

Finetune

#Download finetune datasets

cd Kaleido-BERT/scripts/dataset
sh get_finetune_dataset.sh
sh get_retrieve_dataset.sh

#Testing CAT/SUB

cd Kaleido-BERT/scripts
sh run_cat.sh
sh run_subcat.sh

#Testing TIR/ITR

cd Kaleido-BERT/scripts
sh run_i2t.sh
sh run_t2i.sh

Pre-training

#Download pre-training datasets

cd Kaleido-BERT/scripts/dataset
sh get_prtrain_dataset.sh

#Remove existed checkpoint
rm -rf Kaleido-BERT/checkpoint/pretrained

#Run pre-training
cd Kaleido-BERT/scripts/
sh run_pretrain.sh

Acknowlegement

Thanks Alibaba ICBU Search Team and Alibaba PAI Team for technical support.

Citing Kaleido-BERT

@inproceedings{Zhuge2021KaleidoBERT,
  title={Kaleido-BERT: Vision-Language Pre-training on Fashion Domain},
  author={Zhuge, Mingchen and Gao, Dehong and Fan, Deng-Ping and Jin, Linbo and Chen, Ben and Zhou, Haoming and Qiu, Minghui and Shao, Ling},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={},
  year={2021}
}

Contact

Mingchen Zhuge (email: [email protected] | wechat: tjpxiaoming)
Deng-Ping Fan (email: [email protected])
Dehong Gao (email: [email protected])

Feel free to contact us if you have additional questions.

(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Related tags

Overview

Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Introduction

Noted

Get started

Acknowlegement

Citing Kaleido-BERT

Contact

Owner

Code accompanying the paper Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs (Chen et al., CVPR 2020, Oral).

High performance distributed framework for training deep learning recommendation models based on PyTorch.

SMIS - Semantically Multi-modal Image Synthesis(CVPR 2020)

Progressive Coordinate Transforms for Monocular 3D Object Detection

Dynamic Capacity Networks using Tensorflow

AOT (Associating Objects with Transformers) in PyTorch

Controlling Hill Climb Racing with Hand Tacking

Dahua Camera and Doorbell Home Assistant Integration

Reducing Information Bottleneck for Weakly Supervised Semantic Segmentation (NeurIPS 2021)

Submodular Subset Selection for Active Domain Adaptation (ICCV 2021)

Swin-Transformer is basically a hierarchical Transformer whose representation is computed with shifted windows.

OpenMMLab Text Detection, Recognition and Understanding Toolbox

Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

The CLRS Algorithmic Reasoning Benchmark

Depth-Aware Video Frame Interpolation (CVPR 2019)

Replication Package for AequeVox:Automated Fariness Testing for Speech Recognition Systems

Code for "Human Pose Regression with Residual Log-likelihood Estimation", ICCV 2021 Oral

LieTransformer: Equivariant Self-Attention for Lie Groups

StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks

Code for ICDM2020 full paper: "Sub-graph Contrast for Scalable Self-Supervised Graph Representation Learning"