MERLOT: Multimodal Neural Script Knowledge Models

Last update: Dec 22, 2022

Related tags

Overview

merlot

MERLOT: Multimodal Neural Script Knowledge Models

MERLOT is a model for learning what we are calling "neural script knowledge" -- representations about what is going on in videos, spanning multiple video frames with associated captions.

Visit our project page at rowanzellers.com/merlot, or read the full paper to learn more.

What's here

We are releasing the following:

Code for the MERLOT model (in model/, with data processing in data/
Code for running MERLOT over visual story ordering.

We plan to release:

Information about the videos used in this work
Code for adapting the model to other tasks (not strictly needed, but just to make things easier)

This is somewhat ongoing -- we hope to make it somewhat easier to adapt MERLOT to other tasks, please follow if interested!

Enviroment and setup

There are two different ways of running MERLOT right now

Pretraining on videos This requires a TPU pod.
Finetuning on downstream tasks We did this on TPU v3-8 machines. You can in theory do this on GPUs, however, this isn't tested or officially supported right now.
Zero-shot visual-story ordering I have code for this on a TPU, but you should be able to do this on a GPU too.

conda create --name merlot python=3.7 && conda activate merlot
conda install -y python=3.7 tqdm numpy pyyaml scipy ipython cython typing h5py pandas

# If running on GPU
pip install tensorflow-gpu==1.15.5
# If running on TPU
pip install tensorflow==1.15.5

pip install --upgrade google-api-python-client oauth2client boto3 cloud-tpu-profiler regex opencv-python-headless Pillow seaborn
pip install numpy==1.17.0

Pretraining from scratch

This requires a large TPU pod for data-parallelism.

First, you'll need to get a bunch of training data in "tfrecord" format -- see data processing in data/ for that. You'll then need to adjust the configuration of model/configs/merlot.yaml accordingly. You'll also need to add in your output path (where you want your newly pretrained model to be saved).
Next, in the model directory, run python train.py configs/merlot.yaml

Finetuning on downstream tasks

We used the configuration model/merlot.yaml and the checkpoint at gs://merlot/checkpoint_4segments/ for downstream task finetuning. This is slightly different than the checkpoint we used for story unshuffling (that we had to adapt to account for the 5 frame-caption segments for that task), but both should work.
Actual finetuning code TBD -- you just create a MerlotModel model/modeling.py, set up your finetuning task (usually involving an additional output layer), and finetune.

Bibtex

@article{zellersluhessel2021merlot,
    title={MERLOT: Multimodal Neural Script Knowledge Models},
    author={Zellers, Rowan and Lu, Ximing and Hessel, Jack and Yu, Youngjae and Park, Jae Sung and Cao, Jize and Farhadi, Ali and Choi, Yejin},
    journal={arXiv preprint arXiv:2106.02636},
    year={2021}
}

MERLOT: Multimodal Neural Script Knowledge Models

Related tags

Overview

merlot

What's here

Enviroment and setup

Pretraining from scratch

Finetuning on downstream tasks

Bibtex

Owner

Rowan Zellers

The pytorch implementation of SOKD (BMVC2021).

Continual Learning of Electronic Health Records (EHR).

Automatic labeling, conversion of different data set formats, sample size statistics, model cascade

Image super-resolution (SR) is a fast-moving field with novel architectures attracting the spotlight

Adaptive Pyramid Context Network for Semantic Segmentation (APCNet CVPR'2019)

Experiment about Deep Person Re-identification with EfficientNet-v2

Automatic learning-rate scheduler

Towards Flexible Blind JPEG Artifacts Removal (FBCNN, ICCV 2021)

Additional environments compatible with OpenAI gym

Keras Implementation of The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation by (Simon Jégou, Michal Drozdzal, David Vazquez, Adriana Romero, Yoshua Bengio)

FEDn is an open-source, modular and ML-framework agnostic framework for Federated Machine Learning

[ICCV 2021] Official Pytorch implementation for Discriminative Region-based Multi-Label Zero-Shot Learning SOTA results on NUS-WIDE and OpenImages

Interactive Image Generation via Generative Adversarial Networks

Implementation of Continuous Sparsification, a method for pruning and ticket search in deep networks

PyTorch Personal Trainer: My framework for deep learning experiments

FedScale: Benchmarking Model and System Performance of Federated Learning

PRIN/SPRIN: On Extracting Point-wise Rotation Invariant Features

Official repository for "Restormer: Efficient Transformer for High-Resolution Image Restoration". SOTA results for single-image motion deblurring, image deraining, image denoising (synthetic and real data), and dual-pixel defocus deblurring.

Python code to fuse multiple RGB-D images into a TSDF voxel volume.

Semi-SDP Semi-supervised parser for semantic dependency parsing.