NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient Framework

Related tags

Deep LearningTLM
Overview

NLP From Scratch Without Large-Scale Pretraining

This repository contains the code, pre-trained model checkpoints and curated datasets for our paper: NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient Framework.

In our proposed framework, named TLM (task-driven language modeling), instead of training a language model over the entire general corpus and then finetuning it on task data, we first usetask data as queries to retrieve a tiny subset of the general corpus, and then perform joint learning on both the task objective and self-supervised language modeling objective.

Requirements

We implement our models and training loops based on the opensource products from HuggingFace. The core denpencies of this repository are listed in requirements.txt, which can be installed through:

pip install -r requirements.txt

All our experiments are conducted on a node with 8 A100 40GB SXM gpus. Different computational devices may result slightly different results from the reported ones.

Models and Datasets

We release the trained models on 8 tasks with 3 different scales, together with the task datasets and selected external data. Our released model checkpoints, datasets and the performance of each model for each task are listed in the following table.

AGNews Hyp. Help. IMDB ACL. SciERC Chem. RCT
Small 93.74 93.53 70.54 93.08 69.84 80.51 81.99 86.99
Medium 93.96 94.05 70.90 93.97 72.37 81.88 83.24 87.28
Large 94.36 95.16 72.49 95.77 72.19 83.29 85.12 87.50

The released models and datasets are compatible with HuggingFace's Transformers and Datasets. We provide an example script to evaluate a model checkpoints on a certain task, run

bash example_scripts/evaluate.sh

To get the evaluation results for SciERC with a small-scale model.

Training

We provide two example scripts to train a model from scratch, run

bash example_scripts/train.sh && bash example_scripts/finetune.sh

To train a small-scale model for SciERC. Here example_scripts/train.sh corresponds to the first stage training where the external data ratio and MLM weight are non-zero, and example_scripts/finetune.sh corresponds to the second training stage where no external data or self-supervised loss can be perceived by the model.

Citation

Please cite our paper if you use TLM in your work:

@misc{yao2021tlm,
title={NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient Framework},
author={Yao, Xingcheng and Zheng, Yanan and Yang, Xiaocong and Yang, Zhilin},
year={2021}
}
Owner
Xingcheng Yao
Undergraduate student at IIIS, Tsinghua University
Xingcheng Yao
Official PyTorch implementation of "Improving Face Recognition with Large AgeGaps by Learning to Distinguish Children" (BMVC 2021)

Inter-Prototype (BMVC 2021): Official Project Webpage This repository provides the official PyTorch implementation of the following paper: Improving F

Jungsoo Lee 16 Jun 30, 2022
Implementation for Curriculum DeepSDF

Curriculum-DeepSDF This repository is an implementation for Curriculum DeepSDF. Full paper is available here. Preparation Please follow original setti

Haidong Zhu 69 Dec 29, 2022
GLIP: Grounded Language-Image Pre-training

GLIP: Grounded Language-Image Pre-training Updates 12/06/2021: GLIP paper on arxiv https://arxiv.org/abs/2112.03857. Code and Model are under internal

Microsoft 862 Jan 01, 2023
Official Pytorch implementation of the paper "MotionCLIP: Exposing Human Motion Generation to CLIP Space"

MotionCLIP Official Pytorch implementation of the paper "MotionCLIP: Exposing Human Motion Generation to CLIP Space". Please visit our webpage for mor

Guy Tevet 173 Dec 26, 2022
A really easy-to-use and powerful sudoku solver.

SodukuSolver This is a really useful sudoku solver with a Qt gui. USAGE Enter the numbers in and click "RUN"! If you don't want to wait, simply press

Ujhhgtg Teams 11 Jun 02, 2022
Tensorflow implementation and notebooks for Implicit Maximum Likelihood Estimation

tf-imle Tensorflow 2 and PyTorch implementation and Jupyter notebooks for Implicit Maximum Likelihood Estimation (I-MLE) proposed in the NeurIPS 2021

NEC Laboratories Europe 69 Dec 13, 2022
Python library containing BART query generation and BERT-based Siamese models for neural retrieval.

Neural Retrieval Embedding-based Zero-shot Retrieval through Query Generation leverages query synthesis over large corpuses of unlabeled text (such as

Amazon Web Services - Labs 35 Apr 14, 2022
Solve a Rubiks Cube using Python Opencv and Kociemba module

Rubiks_Cube_Solver Solve a Rubiks Cube using Python Opencv and Kociemba module Main Steps Get the countours of the cube check whether there are tota

Adarsh Badagala 176 Jan 01, 2023
Pmapper is a super-resolution and deconvolution toolkit for python 3.6+

pmapper pmapper is a super-resolution and deconvolution toolkit for python 3.6+. PMAP stands for Poisson Maximum A-Posteriori, a highly flexible and a

NASA Jet Propulsion Laboratory 8 Nov 06, 2022
Tools for computational pathology

A toolkit for computational pathology and machine learning. View documentation Please cite our paper Installation There are several ways to install Pa

254 Dec 12, 2022
Deep Learning Based Fasion Recommendation System for Ecommerce

Project Name: Fasion Recommendation System for Ecommerce A Deep learning based streamlit web app which can recommened you various types of fasion prod

BAPPY AHMED 13 Dec 13, 2022
Dirty Pixels: Towards End-to-End Image Processing and Perception

Dirty Pixels: Towards End-to-End Image Processing and Perception This repository contains the code for the paper Dirty Pixels: Towards End-to-End Imag

50 Nov 18, 2022
[KDD 2021, Research Track] DiffMG: Differentiable Meta Graph Search for Heterogeneous Graph Neural Networks

DiffMG This repository contains the code for our KDD 2021 Research Track paper: DiffMG: Differentiable Meta Graph Search for Heterogeneous Graph Neura

AutoML Research 24 Nov 29, 2022
Convert openmmlab (not only mmdetection) series model to tensorrt

MMDet to TensorRT This project aims to convert the mmdetection model to TensorRT model end2end. Focus on object detection for now. Mask support is exp

JinTian 4 Dec 17, 2021
Official Pytorch Implementation of: "Semantic Diversity Learning for Zero-Shot Multi-label Classification"(2021) paper

Semantic Diversity Learning for Zero-Shot Multi-label Classification Paper Official PyTorch Implementation Avi Ben-Cohen, Nadav Zamir, Emanuel Ben Bar

28 Aug 29, 2022
Stochastic gradient descent with model building

Stochastic Model Building (SMB) This repository includes a new fast and robust stochastic optimization algorithm for training deep learning models. Th

S. Ilker Birbil 22 Jan 19, 2022
Open-World Entity Segmentation

Open-World Entity Segmentation Project Website Lu Qi*, Jason Kuen*, Yi Wang, Jiuxiang Gu, Hengshuang Zhao, Zhe Lin, Philip Torr, Jiaya Jia This projec

DV Lab 410 Jan 03, 2023
Code for Understanding Pooling in Graph Neural Networks

Select, Reduce, Connect This repository contains the code used for the experiments of: "Understanding Pooling in Graph Neural Networks" Setup Install

Daniele Grattarola 37 Dec 13, 2022
YKKDetector For Python

YKKDetector OpenCVを利用した機械学習データをもとに、VRChatのスクリーンショットなどからYKKさん(もとい「幽狐族のお姉様」)を検出できるソフトウェアです。 マニュアル こちらから実行環境のセットアップから解説する詳細なマニュアルをご覧いただけます。 ライセンス 本ソフトウェア

あんふぃとらいと 5 Dec 07, 2021
Download from Onlyfans.com.

OnlySave: Onlyfans downloader Getting Started: Download the setup executable from the latest release. Install and run. Only works on Windows currently

4 May 30, 2022