[AAAI 2022] Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

Last update: Dec 26, 2022

Overview

[AAAI 2022] Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

Official Pytorch implementation of Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding (AAAI 2022).

Paper is at https://arxiv.org/pdf/2109.04872.pdf.

Paper explanation in Zhihu (in Chinese) is at https://zhuanlan.zhihu.com/p/446203594.

Abstract

Temporal grounding aims to localize a video moment which is semantically aligned with a given natural language query. Existing methods typically apply a detection or regression pipeline on the fused representation with the research focus on designing complicated prediction heads or fusion strategies. Instead, from a perspective on temporal grounding as a metric-learning problem, we present a Mutual Matching Network (MMN), to directly model the similarity between language queries and video moments in a joint embedding space. This new metric-learning framework enables fully exploiting negative samples from two new aspects: constructing negative cross-modal pairs in a mutual matching scheme and mining negative pairs across different videos. These new negative samples could enhance the joint representation learning of two modalities via cross-modal mutual matching to maximize their mutual information. Experiments show that our MMN achieves highly competitive performance compared with the state-of-the-art methods on four video grounding benchmarks. Based on MMN, we present a winner solution for the HC-STVG challenge of the 3rd PIC workshop. This suggests that metric learning is still a promising method for temporal grounding via capturing the essential cross-modal correlation in a joint embedding space.

Updates

Dec, 2021 - We uploaded the code and trained weights for Charades-STA, ActivityNet-Captions and TACoS datasets.

Todo: The code for spatio-temporal video grounding (HC-STVG dataset) will be available soon.

Datasets

Download the video feature and the groundtruth provided by 2D-TAN.
Extract and put them in a dataset folder in the same directory as train_net.py. For configurations of feature/groundtruth's paths, please refer to ./mmn/config/paths_catalog.py. (ann_file is annotation, feat_file is the video feature)

Dependencies

Our code is developed on the third-party implementation of 2D-TAN, so we have similar dependencies with it, such as:

yacs h5py terminaltables tqdm pytorch transformers

Quick Start

We provide scripts for simplifying training and inference. For training our model, we provide a script for each dataset (e.g., ./scripts/tacos_train.sh). For evaluating the performance, we provide ./scripts/eval.sh.

For example, for training model in TACoS dataset in tacos_train.sh, we need to select the right config in config and decide the GPU by yourself in gpus (gpu id in your server) and gpun (total number of gpus).

# find all configs in configs/
config=pool_tacos_128x128_k5l8
# set your gpu id
gpus=0,1
# number of gpus
gpun=2
# please modify it with different value (e.g., 127.0.0.2, 29502) when you run multi mmn task on the same machine
master_addr=127.0.0.3
master_port=29511

Similarly, to evaluate the model, just change the information in eval.sh. Our trained weights for three datasets are in the Google Drive.

Citation

If you find our code useful, please generously cite our paper. (AAAI version bibtex will be updated later)

@article{DBLP:journals/corr/abs-2109-04872,
  author    = {Zhenzhi Wang and
               Limin Wang and
               Tao Wu and
               Tianhao Li and
               Gangshan Wu},
  title     = {Negative Sample Matters: {A} Renaissance of Metric Learning for Temporal
               Grounding},
  journal   = {CoRR},
  volume    = {abs/2109.04872},
  year      = {2021}
}

Contact

For any question, please raise an issue (preferred) or contact

Zhenzhi Wang: [email protected]

Acknowledgement

We appreciate 2D-TAN for video feature and configurations, and the third-party implementation of 2D-TAN for its implementation with DistributedDataParallel. Disclaimer: the performance gain of this third-party implementation is due to a tiny mistake of adding val set into training, yet our reproduced result is similar to the reported result in 2D-TAN paper.

[AAAI 2022] Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

Related tags

Overview

[AAAI 2022] Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

Abstract

Updates

Datasets

Dependencies

Quick Start

Citation

Contact

Acknowledgement

Owner

Multimedia Computing Group, Nanjing University

Generative Modelling of BRDF Textures from Flash Images [SIGGRAPH Asia, 2021]

Knowledgeable Prompt-tuning: Incorporating Knowledge into Prompt Verbalizer for Text Classification

Joint project of the duo Hacker Ninjas

LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

Official Pytorch implementation of C3-GAN

Code for our paper: Online Variational Filtering and Parameter Learning

The Official PyTorch Implementation of DiscoBox.

Multi-robot collaborative exploration and mapping through Voronoi partition and DRL in unknown environment

This is an open solution to the Home Credit Default Risk challenge 🏡

Code for our NeurIPS 2021 paper: Sparsely Changing Latent States for Prediction and Planning in Partially Observable Domains

HyperDict - Self linked dictionary in Python

This repo includes the CUB-GHA (Gaze-based Human Attention) dataset and code of the paper "Human Attention in Fine-grained Classification".

A NSFW content filter.

Simple Tensorflow implementation of Toward Spatially Unbiased Generative Models (ICCV 2021)

ICRA 2021 - Robust Place Recognition using an Imaging Lidar

KoRean based ELECTRA pre-trained models (KR-ELECTRA) for Tensorflow and PyTorch

Model Quantization Benchmark

This is the official source code of "BiCAT: Bi-Chronological Augmentation of Transformer for Sequential Recommendation".

Tutorial: Introduction to Graph Machine Learning, with Jupyter notebooks

PyTorch implementation of "A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement."