Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Last update: Jan 03, 2023

Overview

Text-AutoAugment (TAA)

This repository contains the code for our paper Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification (EMNLP 2021 main conference).

Overview

We present a learnable and compositional framework for data augmentation. Our proposed algorithm automatically searches for the optimal compositional policy, which improves the diversity and quality of augmented samples.
In low-resource and class-imbalanced regimes of six benchmark datasets, TAA significantly improves the generalization ability of deep neural networks like BERT and effectively boosts text classification performance.

Getting Started

Prepare environment

conda create -n taa python=3.6
conda activate taa
conda install pytorch torchvision cudatoolkit=10.0 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
pip install -r requirements.txt 
python -c "import nltk; nltk.download('wordnet'); nltk.download('averaged_perceptron_tagger')"

Modify dataroot parameter in confs/*yaml and abspath parameter in script/*.sh:
- e.g., change dataroot: /home/renshuhuai/TextAutoAugment/data/aclImdb in confs/bert_imdb.yaml to dataroot: path-to-your-TextAutoAugment/data/aclImdb
- change --abspath '/home/renshuhuai/TextAutoAugment' in script/imdb_lowresource.sh to --abspath 'path-to-your-TextAutoAugment'
Search for the best augmentation policy, e.g., low-resource regime for IMDB:
```
sh script/imdb_lowresource.sh
```
scripts for policy search in the low-resource and class-imbalanced regime for all datasets are provided in the script/ fold.
Train a model with pre-searched policy in archive.py, e.g., train model in low-resource regime for IMDB:
```
python train.py -c confs/bert_imdb.yaml 
```
train model on full dataset of IMDB:
```
python train.py -c confs/bert_imdb.yaml --train-npc -1 --valid-npc -1 --test-npc -1  
```

Contact

If you have any questions related to the code or the paper, feel free to email Shuhuai (renshuhuai007 [AT] gmail [DOT] com).

Acknowledgments

Code refers to: fast-autoaugment.

Citation

If you find this code useful for your research, please consider citing:

@inproceedings{ren2021taa,
  title={Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification},
  author={Shuhuai Ren, Jinchao Zhang, Lei Li, Xu Sun, Jie Zhou},
  booktitle={EMNLP},
  year={2021}
}

License

MIT

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Related tags

Overview

Text-AutoAugment (TAA)

Overview

Getting Started

Contact

Acknowledgments

Citation

License

Owner

LancoPKU

A CSRankings-like index for speech researchers

A Word Level Transformer layer based on PyTorch and 🤗 Transformers.

This is Assignment1 code for the Web Data Processing System.

A Flask Sentiment Analysis API, with visual implementation

Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

Unsupervised text tokenizer focused on computational efficiency

Multiple implementations for abstractive text summurization , using google colab

Pipeline for training LSA models using Scikit-Learn.

An end to end ASR Transformer model training repo

Based on 125GB of data leaked from Twitch, you can see their monthly revenues from 2019-2021

Contact Extraction with Question Answering.

Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

An extension for asreview implements a version of the tf-idf feature extractor that saves the matrix and the vocabulary.

Autoregressive Entity Retrieval

Implementation of legal QA system based on SentenceKoBART

This project uses unsupervised machine learning to identify correlations between daily inoculation rates in the USA and twitter sentiment in regards to COVID-19.

BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents

Levenshtein and Hamming distance computation

Word Bot for JKLM Bomb Party

Chinese real time voice cloning (VC) and Chinese text to speech (TTS).