Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Overview

Text-AutoAugment (TAA)

This repository contains the code for our paper Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification (EMNLP 2021 main conference).

Overview of IAIS

Overview

  1. We present a learnable and compositional framework for data augmentation. Our proposed algorithm automatically searches for the optimal compositional policy, which improves the diversity and quality of augmented samples.

  2. In low-resource and class-imbalanced regimes of six benchmark datasets, TAA significantly improves the generalization ability of deep neural networks like BERT and effectively boosts text classification performance.

Getting Started

  1. Prepare environment

    conda create -n taa python=3.6
    conda activate taa
    conda install pytorch torchvision cudatoolkit=10.0 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
    pip install -r requirements.txt 
    python -c "import nltk; nltk.download('wordnet'); nltk.download('averaged_perceptron_tagger')"
  2. Modify dataroot parameter in confs/*yaml and abspath parameter in script/*.sh:

    • e.g., change dataroot: /home/renshuhuai/TextAutoAugment/data/aclImdb in confs/bert_imdb.yaml to dataroot: path-to-your-TextAutoAugment/data/aclImdb
    • change --abspath '/home/renshuhuai/TextAutoAugment' in script/imdb_lowresource.sh to --abspath 'path-to-your-TextAutoAugment'
  3. Search for the best augmentation policy, e.g., low-resource regime for IMDB:

    sh script/imdb_lowresource.sh

    scripts for policy search in the low-resource and class-imbalanced regime for all datasets are provided in the script/ fold.

  4. Train a model with pre-searched policy in archive.py, e.g., train model in low-resource regime for IMDB:

    python train.py -c confs/bert_imdb.yaml 

    train model on full dataset of IMDB:

    python train.py -c confs/bert_imdb.yaml --train-npc -1 --valid-npc -1 --test-npc -1  

Contact

If you have any questions related to the code or the paper, feel free to email Shuhuai (renshuhuai007 [AT] gmail [DOT] com).

Acknowledgments

Code refers to: fast-autoaugment.

Citation

If you find this code useful for your research, please consider citing:

@inproceedings{ren2021taa,
  title={Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification},
  author={Shuhuai Ren, Jinchao Zhang, Lei Li, Xu Sun, Jie Zhou},
  booktitle={EMNLP},
  year={2021}
}

License

MIT

Owner
LancoPKU
Language Computing and Machine Learning Group (Xu Sun's group) at Peking University
LancoPKU
A CSRankings-like index for speech researchers

Speech Rankings This project mimics CSRankings to generate an ordered list of researchers in speech/spoken language processing along with their possib

Mutian He 19 Nov 26, 2022
A Word Level Transformer layer based on PyTorch and 🤗 Transformers.

Transformer Embedder A Word Level Transformer layer based on PyTorch and 🤗 Transformers. How to use Install the library from PyPI: pip install transf

Riccardo Orlando 27 Nov 20, 2022
This is Assignment1 code for the Web Data Processing System.

This is a Python program to Entity Linking by processing WARC files. We recognize entities from web pages and link them to a Knowledge Base(Wikidata).

3 Dec 04, 2022
A Flask Sentiment Analysis API, with visual implementation

The Sentiment Analysis Api was created using python flask module,it allows users to parse a text or sentence throught the (?text) arguement, then view the sentiment analysis of that sentence. It can

Ifechukwudeni Oweh 10 Jul 17, 2022
Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

STEMM: Self-learning with Speech-Text Manifold Mixup for Speech Translation This is a PyTorch implementation for the ACL 2022 main conference paper ST

ICTNLP 29 Oct 16, 2022
Unsupervised text tokenizer focused on computational efficiency

YouTokenToMe YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE)

VK.com 847 Dec 19, 2022
Multiple implementations for abstractive text summurization , using google colab

Text Summarization models if you are able to endorse me on Arxiv, i would be more than glad https://arxiv.org/auth/endorse?x=FRBB89 thanks This repo i

463 Dec 26, 2022
Pipeline for training LSA models using Scikit-Learn.

Latent Semantic Analysis Pipeline for training LSA models using Scikit-Learn. Usage Instead of writing custom code for latent semantic analysis, you j

Dani El-Ayyass 23 Sep 05, 2022
An end to end ASR Transformer model training repo

END TO END ASR TRANSFORMER 本项目基于transformer 6*encoder+6*decoder的基本结构构造的端到端的语音识别系统 Model Instructions 1.数据准备: 自行下载数据,遵循文件结构如下: ├── data │ ├── train │

旷视天元 MegEngine 10 Jul 19, 2022
Based on 125GB of data leaked from Twitch, you can see their monthly revenues from 2019-2021

Twitch Revenues Bu script'i kullanarak istediğiniz yayıncıların, Twitch'den sızdırılan 125 GB'lik veriye dayanarak, 2019-2021 arası aylık gelirlerini

4 Nov 11, 2021
Contact Extraction with Question Answering.

contactsQA Extraction of contact entities from address blocks and imprints with Extractive Question Answering. Goal Input: Dr. Max Mustermann Hauptstr

Jan 2 Apr 20, 2022
Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

patterns-finder Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Ex

22 Dec 19, 2022
An extension for asreview implements a version of the tf-idf feature extractor that saves the matrix and the vocabulary.

Extension - matrix and vocabulary extractor for TF-IDF and Doc2Vec An extension for ASReview that adds a tf-idf extractor that saves the matrix and th

ASReview 4 Jun 17, 2022
Autoregressive Entity Retrieval

The GENRE (Generative ENtity REtrieval) system as presented in Autoregressive Entity Retrieval implemented in pytorch. @inproceedings{decao2020autoreg

Meta Research 611 Dec 16, 2022
Implementation of legal QA system based on SentenceKoBART

LegalQA using SentenceKoBART Implementation of legal QA system based on SentenceKoBART How to train SentenceKoBART Based on Neural Search Engine Jina

Heewon Jeon(gogamza) 75 Dec 27, 2022
This project uses unsupervised machine learning to identify correlations between daily inoculation rates in the USA and twitter sentiment in regards to COVID-19.

Twitter COVID-19 Sentiment Analysis Members: Christopher Bach | Khalid Hamid Fallous | Jay Hirpara | Jing Tang | Graham Thomas | David Wetherhold Pro

4 Oct 15, 2022
BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents

BROS (BERT Relying On Spatiality) is a pre-trained language model focusing on text and layout for better key information extraction from documents. Given the OCR results of the document image, which

Clova AI Research 94 Dec 30, 2022
Levenshtein and Hamming distance computation

distance - Utilities for comparing sequences This package provides helpers for computing similarities between arbitrary sequences. Included metrics ar

112 Dec 22, 2022
Word Bot for JKLM Bomb Party

Word Bot for JKLM Bomb Party A bot for Bomb Party on https://www.jklm.fun (Only English) Requirements pynput pyperclip pyautogui Usage: Step 1: Run th

Nicolas 7 Oct 30, 2022
Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Chinese real time voice cloning (VC) and Chinese text to speech (TTS). 好用的中文语音克隆兼中文语音合成系统,包含语音编码器、语音合成器、声码器和可视化模块。

Kuang Dada 6 Nov 08, 2022