Scalable training for dense retrieval models.

Last update: Dec 28, 2022

Related tags

Overview

Scalable implementation of dense retrieval.

Training on cluster

By default it trains locally:

PYTHONPATH=.:$PYTHONPATH python dpr_scale/main.py trainer.gpus=1

SLURM Training

To train the model on SLURM, run:

PYTHONPATH=.:$PYTHONPATH python dpr_scale/main.py -m trainer=slurm trainer.num_nodes=2 trainer.gpus=2

Reproduce DPR on 8 gpus

PYTHONPATH=.:$PYTHONPATH python dpr_scale/main.py -m --config-name nq.yaml  +hydra.launcher.name=dpr_stl_nq_reproduce

Generate embeddings on Wikipedia

PYTHONPATH=.:$PYTHONPATH python dpr_scale/generate_embeddings.py -m --config-name nq.yaml datamodule=generate datamodule.test_path=psgs_w100.tsv +task.ctx_embeddings_dir=<CTX_EMBEDDINGS_DIR> +task.checkpoint_path=<CHECKPOINT_PATH>

Get retrieval results

Currently this runs on 1 GPU. Use CTX_EMBEDDINGS_DIR from above.

PYTHONPATH=.:$PYTHONPATH python dpr_scale/run_retrieval.py --config-name nq.yaml trainer=gpu_1_host trainer.gpus=1 +task.output_path=<PATH_TO_OUTPUT_JSON> +task.ctx_embeddings_dir=<CTX_EMBEDDINGS_DIR> +task.checkpoint_path=<CHECKPOINT_PATH> +task.passages=psgs_w100.tsv datamodule.test_path=<PATH_TO_QUERIES_JSONL>

Generate query embeddings

Alternatively, query embedding generation and retrieval can be separated. After query embeddings are generated using the following command, the run_retrieval_fb.py or run_retrieval_multiset.py script can be used to perform retrieval.

PYTHONPATH=.:$PYTHONPATH python dpr_scale/generate_query_embeddings.py -m --config-name nq.yaml trainer.gpus=1 datamodule.test_path=<PATH_TO_QUERIES_JSONL> +task.ctx_embeddings_dir=<CTX_EMBEDDINGS_DIR> +task.checkpoint_path=<CHECKPOINT_PATH> +task.query_emb_output_path=<OUTPUT_TO_QUERY_EMB>

Get evaluation metrics for a given JSON output file

python dpr_scale/eval_dpr.py --retrieval <PATH_TO_OUTPUT_JSON> --topk 1 5 10 20 50 100

Get evaluation metrics for MSMARCO

python dpr_scale/msmarco_eval.py ~data/msmarco/qrels.dev.small.tsv PATH_TO_OUTPUT_JSON

Domain-matched Pre-training Tasks for Dense Retrieval

Paper: https://arxiv.org/abs/2107.13602

The sections below provide links to datasets and pretrained models, as well as, instructions to prepare datasets, pretrain and fine-tune them.

Q&A Datasets

PAQ

Download the dataset from here

Conversational Datasets

You can download the dataset from the respective tables.

File	Download Link
train	download
dev	download

ConvAI2

File	Download Link
train	download
dev	download

DSTC7

File	Download Link
train	download
dev	download
test	download

Prepare by downloading the tar ball linked here, and using the command below.

DSTC7_DATA_ROOT=<path_of_dir_where_the_data_is_extracted>
python dpr_scale/data_prep/prep_conv_datasets.py \
    --dataset dstc7 \
    --in_file_path $DSTC7_DATA_ROOT/ubuntu_train_subtask_1_augmented.json \
    --out_file_path $DSTC7_DATA_ROOT/ubuntu_train.jsonl

Ubuntu V2

File	Download Link
train	download
dev	download
test	download

Prepare by downloading the tar ball linked here, and using the command below.

UBUNTUV2_DATA_ROOT=<path_of_dir_where_the_data_is_extracted>
python dpr_scale/data_prep/prep_conv_datasets.py \
    --dataset ubuntu2 \
    --in_file_path $UBUNTUV2_DATA_ROOT/train.csv \
    --out_file_path $UBUNTUV2_DATA_ROOT/train.jsonl

Pretraining DPR

Pretrained Checkpoints

Pretrained Model	Dataset	Download Link
BERT-base	PAQ	download
BERT-large	PAQ	download
BERT-base	Reddit	download
BERT-large	Reddit	download
RoBERTa-base	Reddit	download
RoBERTa-large	Reddit	download

Pretraining on PAQ dataset

DPR_ROOT=<path_of_your_repo's_root>
MODEL="bert-large-uncased"
NODES=8
BSZ=16
MAX_EPOCHS=20
LR=1e-5
TIMOUT_MINS=4320
EXP_DIR=<path_of_the_experiment_dir>
TRAIN_PATH=<path_of_the_training_data_file>
mkdir -p ${EXP_DIR}/logs
PYTHONPATH=$DPR_ROOT python ${DPR_ROOT}/dpr_scale/main.py -m \
    --config-dir ${DPR_ROOT}/dpr_scale/conf \
    --config-name nq.yaml \
    hydra.launcher.timeout_min=$TIMOUT_MINS \
    hydra.sweep.dir=${EXP_DIR} \
    trainer.num_nodes=${NODES} \
    task.optim.lr=${LR} \
    task.model.model_path=${MODEL} \
    trainer.max_epochs=${MAX_EPOCHS} \
    datamodule.train_path=$TRAIN_PATH \
    datamodule.batch_size=${BSZ} \
    datamodule.num_negative=1 \
    datamodule.num_val_negative=10 \
    datamodule.num_test_negative=50 > ${EXP_DIR}/logs/log.out 2> ${EXP_DIR}/logs/log.err &

Pretraining on Reddit dataset

# Use a batch size of 16 for BERT and RoBERTa base models.
BSZ=4
NODES=8
MAX_EPOCHS=5
WARMUP_STEPS=10000
LR=1e-5
MODEL="roberta-large"
EXP_DIR=<path_of_the_experiment_dir>
PYTHONPATH=. python dpr_scale/main.py -m \
    --config-dir ${DPR_ROOT}/dpr_scale/conf \
    --config-name reddit.yaml \
    hydra.launcher.nodes=${NODES} \
    hydra.sweep.dir=${EXP_DIR} \
    trainer.num_nodes=${NODES} \
    task.optim.lr=${LR} \
    task.model.model_path=${MODEL} \
    trainer.max_epochs=${MAX_EPOCHS} \
    task.warmup_steps=${WARMUP_STEPS} \
    datamodule.batch_size=${BSZ} > ${EXP_DIR}/logs/log.out 2> ${EXP_DIR}/logs/log.err &

Fine-tuning DPR on downstream tasks/datasets

Fine-tune the pretrained PAQ checkpoint

# You can also try 2e-5 or 5e-5. Usually these 3 learning rates work best.
LR=1e-5
# Use a batch size of 32 for BERT and RoBERTa base models.
BSZ=12
MODEL="bert-large-uncased"
MAX_EPOCHS=40
WARMUP_STEPS=1000
NODES=1
PRETRAINED_CKPT_PATH=<path_of_checkpoint_pretrained_on_reddit>
EXP_DIR=<path_of_the_experiment_dir>
PYTHONPATH=. python dpr_scale/main.py -m \
    --config-dir ${DPR_ROOT}/dpr_scale/conf \
    --config-name nq.yaml \
    hydra.launcher.name=${NAME} \
    hydra.sweep.dir=${EXP_DIR} \
    trainer.num_nodes=${NODES} \
    trainer.max_epochs=${MAX_EPOCHS} \
    datamodule.num_negative=1 \
    datamodule.num_val_negative=25 \
    datamodule.num_test_negative=50 \
    +trainer.val_check_interval=150 \
    task.warmup_steps=${WARMUP_STEPS} \
    task.optim.lr=${LR} \
    task.pretrained_checkpoint_path=$PRETRAINED_CKPT_PATH \
    task.model.model_path=${MODEL} \
    datamodule.batch_size=${BSZ} > ${EXP_DIR}/logs/log.out 2> ${EXP_DIR}/logs/log.err &

Fine-tune the pretrained Reddit checkpoint

Batch sizes that worked on Volta 32GB GPUs for respective model and datasets.

Model	Dataset	Batch Size
BERT/RoBERTa base	ConvAI2	64
RBERT/RoBERTa base	ConvAI2	16
BERT/RoBERTa base	DSTC7	24
BERT/RoBERTa base	DSTC7	8
BERT/RoBERTa base	Ubuntu V2	64
BERT/RoBERTa large	Ubuntu V2	16

# Change the config file name to convai2.yaml or dstc7.yaml for the respective datasets.
CONFIG_FILE_NAME=ubuntuv2.yaml
# You can also try 2e-5 or 5e-5. Usually these 3 learning rates work best.
LR=1e-5
BSZ=16
NODES=1
MAX_EPOCHS=5
WARMUP_STEPS=10000
MODEL="roberta-large"
PRETRAINED_CKPT_PATH=<path_of_checkpoint_pretrained_on_reddit>
EXP_DIR=<path_of_the_experiment_dir>
PYTHONPATH=${DPR_ROOT} python ${DPR_ROOT}/dpr_scale/main.py -m \
    --config-dir=${DPR_ROOT}/dpr_scale/conf \
    --config-name=$CONFIG_FILE_NAME \
    hydra.launcher.nodes=${NODES} \
    hydra.sweep.dir=${EXP_DIR} \
    trainer.num_nodes=${NODES} \
    trainer.max_epochs=${MAX_EPOCHS} \
    +trainer.val_check_interval=150 \
    task.pretrained_checkpoint_path=$PRETRAINED_CKPT_PATH \
    task.warmup_steps=${WARMUP_STEPS} \
    task.optim.lr=${LR} \
    task.model.model_path=$MODEL \
    datamodule.batch_size=${BSZ} > ${EXP_DIR}/logs/log.out 2> ${EXP_DIR}/logs/log.err &

License

dpr-scale is CC-BY-NC 4.0 licensed as of now.

Scalable training for dense retrieval models.

Related tags

Overview

Training on cluster

SLURM Training

Reproduce DPR on 8 gpus

Generate embeddings on Wikipedia

Get retrieval results

Generate query embeddings

Get evaluation metrics for a given JSON output file

Get evaluation metrics for MSMARCO

Domain-matched Pre-training Tasks for Dense Retrieval

Q&A Datasets

PAQ

Conversational Datasets

Reddit

ConvAI2

DSTC7

Ubuntu V2

Pretraining DPR

Pretrained Checkpoints

Pretraining on PAQ dataset

Pretraining on Reddit dataset

Fine-tuning DPR on downstream tasks/datasets

Fine-tune the pretrained PAQ checkpoint

Fine-tune the pretrained Reddit checkpoint

License

Owner

Facebook Research

Pytorch re-implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition (CVPR 2022)

Top #1 Submission code for the first https://alphamev.ai MEV competition with best AUC (0.9893) and MSE (0.0982).

Autonomous racing with the Anki Overdrive

Supervised domain-agnostic prediction framework for probabilistic modelling

[ACM MM 2021] TSA-Net: Tube Self-Attention Network for Action Quality Assessment

Adaptive Attention Span for Reinforcement Learning

No-reference Image Quality Assessment(NIQA) Algorithms (BRISQUE, NIQE, PIQE, RankIQA, MetaIQA)

A minimalist tool to display a network graph.

Cancer-and-Tumor-Detection-Using-Inception-model - In this repo i am gonna show you how i did cancer/tumor detection in lungs using deep neural networks, specifically here the Inception model by google.

Computer Vision Paper Reviews with Key Summary of paper, End to End Code Practice and Jupyter Notebook converted papers

Real-time ground filtering algorithm of cloud points acquired using Terrestrial Laser Scanner (TLS)

ICLR 2021, Fair Mixup: Fairness via Interpolation

A simple consistency training framework for semi-supervised image semantic segmentation

Easy-to-use library to boost AI inference leveraging state-of-the-art optimization techniques.

NumPy로 구현한 딥러닝 라이브러리입니다. (자동 미분 지원)

Python 3 module to print out long strings of text with intervals of time inbetween

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

Explore extreme compression for pre-trained language models

Demonstrates how to divide a DL model into multiple IR model files (division) and introduce a simplest way to implement a custom layer works with OpenVINO IR models.

An efficient 3D semantic segmentation framework for Urban-scale point clouds like SensatUrban, Campus3D, etc.