Unsupervised Language Model Pre-training for French

Overview

FlauBERT and FLUE

FlauBERT is a French BERT trained on a very large and heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for Scientific Research) Jean Zay supercomputer. This repository shares everything: pre-trained models (base and large), the data, the code to use the models and the code to train them if you need.

Along with FlauBERT comes FLUE: an evaluation setup for French NLP systems similar to the popular GLUE benchmark. The goal is to enable further reproducible experiments in the future and to share models and progress on the French language.

This repository is still under construction and everything will be available soon.

Table of Contents

1. FlauBERT models
2. Using FlauBERT
    2.1. Using FlauBERT with Hugging Face's Transformers
    2.2. Using FlauBERT with Facebook XLM's library
3. Pre-training FlauBERT
    3.1. Data
    3.2. Training
    3.3. Convert an XLM pre-trained model to Hugging Face's Transformers
4. Fine-tuning FlauBERT on the FLUE benchmark
5. Citation

1. FlauBERT models

FlauBERT is a French BERT trained on a very large and heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for Scientific Research) Jean Zay supercomputer. We have released the pretrained weights for the following model sizes.

The pretrained models are available for download from here or via Hugging Face's library.

Model name Number of layers Attention Heads Embedding Dimension Total Parameters
flaubert-small-cased 6 8 512 54 M
flaubert-base-uncased 12 12 768 137 M
flaubert-base-cased 12 12 768 138 M
flaubert-large-cased 24 16 1024 373 M

Note: flaubert-small-cased is partially trained so performance is not guaranteed. Consider using it for debugging purpose only.

We also provide the checkpoints from here for model base (cased/uncased) and large (cased).

2. Using FlauBERT

In this section, we describe two ways to obtain sentence embeddings from pretrained FlauBERT models: either via Hugging Face's Transformer library or via Facebook's XLM library. We will intergrate FlauBERT into Facebook' fairseq in the near future.

2.1. Using FlauBERT with Hugging Face's Transformers

You can use FlauBERT with Hugging Face's Transformers library as follow.

import torch
from transformers import FlaubertModel, FlaubertTokenizer

# Choose among ['flaubert/flaubert_small_cased', 'flaubert/flaubert_base_uncased', 
#               'flaubert/flaubert_base_cased', 'flaubert/flaubert_large_cased']
modelname = 'flaubert/flaubert_base_cased' 

# Load pretrained model and tokenizer
flaubert, log = FlaubertModel.from_pretrained(modelname, output_loading_info=True)
flaubert_tokenizer = FlaubertTokenizer.from_pretrained(modelname, do_lowercase=False)
# do_lowercase=False if using cased models, True if using uncased ones

sentence = "Le chat mange une pomme."
token_ids = torch.tensor([flaubert_tokenizer.encode(sentence)])

last_layer = flaubert(token_ids)[0]
print(last_layer.shape)
# torch.Size([1, 8, 768])  -> (batch size x number of tokens x embedding dimension)

# The BERT [CLS] token correspond to the first hidden state of the last layer
cls_embedding = last_layer[:, 0, :]

Notes: if your transformers version is <=2.10.0, modelname should take one of the following values:

['flaubert-small-cased', 'flaubert-base-uncased', 'flaubert-base-cased', 'flaubert-large-cased']

2.2. Using FlauBERT with Facebook XLM's library

The pretrained FlauBERT models are available for download from here. Each compressed folder includes 3 files:

  • *.pth: FlauBERT's pretrained model.
  • codes: BPE codes learned on the training data.
  • vocab: BPE vocabulary file.

Note: The following example only works for the modified XLM provided in this repo, it won't work for the original XLM. The code is taken from this tutorial.

import sys
import torch
import fastBPE

# Add Flaubert root to system path (change accordingly)
FLAUBERT_ROOT = '/home/user/Flaubert'
sys.path.append(FLAUBERT_ROOT)

from xlm.model.embedder import SentenceEmbedder
from xlm.data.dictionary import PAD_WORD


# Paths to model files
model_path = '/home/user/flaubert_base_cased/flaubert_base_cased_xlm.pth'
codes_path = '/home/user/flaubert_base_cased/codes'
vocab_path = '/home/user/flaubert_base_cased/vocab'
do_lowercase = False # Change this to True if you use uncased FlauBERT

bpe = fastBPE.fastBPE(codes_path, vocab_path)

sentences = "Le chat mange une pomme ."
if do_lowercase:
    sentences = sentences.lower()

# Apply BPE
sentences = bpe.apply([sentences])
sentences = [(('</s> %s </s>' % sent.strip()).split()) for sent in sentences]
print(sentences)

# Create batch
bs = len(sentences)
slen = max([len(sent) for sent in sentences])

# Reload pretrained model
embedder = SentenceEmbedder.reload(model_path)
embedder.eval()
dico = embedder.dico

# Prepare inputs to model
word_ids = torch.LongTensor(slen, bs).fill_(dico.index(PAD_WORD))
for i in range(len(sentences)):
    sent = torch.LongTensor([dico.index(w) for w in sentences[i]])
    word_ids[:len(sent), i] = sent
lengths = torch.LongTensor([len(sent) for sent in sentences])

# Get sentence embeddings (corresponding to the BERT [CLS] token)
cls_embedding = embedder.get_embeddings(x=word_ids, lengths=lengths)
print(cls_embedding.size())

# Get the entire output tensor for all tokens
# Note that cls_embedding = tensor[0]
tensor = embedder.get_embeddings(x=word_ids, lengths=lengths, all_tokens=True)
print(tensor.size())

3. Pre-training FlauBERT

Install dependencies

You should clone this repo and then install WikiExtractor, fastBPE and Moses tokenizer under tools:

git clone https://github.com/getalp/Flaubert.git
cd Flaubert

# Install toolkit
cd tools
git clone https://github.com/attardi/wikiextractor.git
git clone https://github.com/moses-smt/mosesdecoder.git

git clone https://github.com/glample/fastBPE.git
cd fastBPE
g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast

3.1. Data

In this section, we describe the pipeline to prepare the data for training FlauBERT. This is based on Facebook XLM's library. The steps are as follows:

  1. Download, clean, and tokenize data using Moses tokenizer.
  2. Split cleaned data into: train, validation, and test sets.
  3. Learn BPE on the training set. Then apply learned BPE codes to train, validation, and test sets.
  4. Binarize data.

(1) Download and Preprocess Data

In the following, replace $DATA_DIR, $corpus_name respectively with the path to the local directory to save the downloaded data and the name of the corpus that you want to download among the options specified in the scripts.

To download and preprocess the data, excecute the following commands:

./download.sh $DATA_DIR $corpus_name fr
./preprocess.sh $DATA_DIR $corpus_name fr

For example:

./download.sh ~/data gutenberg fr
./preprocess.sh ~/data gutenberg fr

The first command will download the raw data to $DATA_DIR/raw/fr_gutenberg, the second one processes them and save to $DATA_DIR/processed/fr_gutenberg.

(2) Split Data

Run the following command to split cleaned corpus into train, validation, and test sets. You can modify the train/validation/test ratio in the script.

bash tools/split_train_val_test.sh $DATA_PATH

where $DATA_PATH is path to the file to be split.

The output files are: fr.train, fr.valid, fr.test which are saved under the same directory as the original file.

(3) & (4) Learn BPE and Prepare Data

Run the following command to learn BPE codes on the training set, and apply BPE codes on the train, validation, and test sets. The data is then binarized and ready for training.

bash tools/create_pretraining_data.sh $DATA_DIR $BPE_size

where $DATA_DIR is path to the directory where the 3 above files fr.train, fr.valid, fr.test are saved. $BPE_size is the number of BPE vocabulary size, for example: 30 for 30k,50 for 50k, etc. The output files are saved in $DATA_DIR/BPE/30k or $DATA_DIR/BPE/50k correspondingly.

3.2. Training

Our codebase for pretraining FlauBERT is largely based on the XLM repo, with some modifications. You can use their code to train FlauBERT, it will work just fine.

Execute the following command to train FlauBERT (base) on your preprocessed data:

python train.py \
    --exp_name flaubert_base_cased \
    --dump_path $dump_path \
    --data_path $data_path \
    --amp 1 \
    --lgs 'fr' \
    --clm_steps '' \
    --mlm_steps 'fr' \
    --emb_dim 768 \
    --n_layers 12 \
    --n_heads 12 \
    --dropout 0.1 \
    --attention_dropout 0.1 \
    --gelu_activation true \
    --batch_size 16 \
    --bptt 512 \
    --optimizer "adam_inverse_sqrt,lr=0.0006,warmup_updates=24000,beta1=0.9,beta2=0.98,weight_decay=0.01,eps=0.000001" \
    --epoch_size 300000 \
    --max_epoch 100000 \
    --validation_metrics _valid_fr_mlm_ppl \
    --stopping_criterion _valid_fr_mlm_ppl,20 \
    --fp16 true \
    --accumulate_gradients 16 \
    --word_mask_keep_rand '0.8,0.1,0.1' \
    --word_pred '0.15'                      

where $dump_path is the path to where you want to save your pretrained model, $data_path is the path to the binarized data sets, for example $DATA_DIR/BPE/50k.

Run experiments on multiple GPUs and/or multiple nodes

To run experiments on multiple GPUs in a single machine, you can use the following command (the parameters after train.py are the same as above).

export NGPU=4
export CUDA_VISIBLE_DEVICES=0,1,2,3,4 # if you only use some of the GPUs in the machine
python -m torch.distributed.launch --nproc_per_node=$NGPU train.py

To run experiments on multiple nodes, multiple GPUs in clusters using SLURM as a resource manager, you can use the following command to launch training after requesting resources with #SBATCH (the parameters after train.py are the same as above plus --master_port parameter).

srun python train.py

3.3. Convert an XLM pre-trained model to Hugging Face's Transformers

To convert an XLM pre-trained model to Hugging Face's Transformers, you can use the following command.

python tools/use_flaubert_with_transformers/convert_to_transformers.py --inputdir $inputdir --outputdir $outputdir

where $inputdir is path to the XLM pretrained model directory, $outputdir is path to the output directory where you want to save the Hugging Face's Transformer model.

4. Fine-tuning FlauBERT on the FLUE benchmark

FLUE (French Language Understanding Evaludation) is a general benchmark for evaluating French NLP systems. Please refer to this page for an example of fine-tuning FlauBERT on this benchmark.

5. Video presentation

You can watch this 7mn video presentation of FlauBERT [VIDEO 7mn] (https://www.youtube.com/watch?v=NgLM9GuwSwc)

6. Citation

If you use FlauBERT or the FLUE Benchmark for your scientific publication, or if you find the resources in this repository useful, please cite one of the following papers:

LREC paper

@InProceedings{le2020flaubert,
  author    = {Le, Hang  and  Vial, Lo\"{i}c  and  Frej, Jibril  and  Segonne, Vincent  and  Coavoux, Maximin  and  Lecouteux, Benjamin  and  Allauzen, Alexandre  and  Crabb\'{e}, Beno\^{i}t  and  Besacier, Laurent  and  Schwab, Didier},
  title     = {FlauBERT: Unsupervised Language Model Pre-training for French},
  booktitle = {Proceedings of The 12th Language Resources and Evaluation Conference},
  month     = {May},
  year      = {2020},
  address   = {Marseille, France},
  publisher = {European Language Resources Association},
  pages     = {2479--2490},
  url       = {https://www.aclweb.org/anthology/2020.lrec-1.302}
}

TALN paper

@inproceedings{le2020flaubert,
  title         = {FlauBERT: des mod{\`e}les de langue contextualis{\'e}s pr{\'e}-entra{\^\i}n{\'e}s pour le fran{\c{c}}ais},
  author        = {Le, Hang and Vial, Lo{\"\i}c and Frej, Jibril and Segonne, Vincent and Coavoux, Maximin and Lecouteux, Benjamin and Allauzen, Alexandre and Crabb{\'e}, Beno{\^\i}t and Besacier, Laurent and Schwab, Didier},
  booktitle     = {Actes de la 6e conf{\'e}rence conjointe Journ{\'e}es d'{\'E}tudes sur la Parole (JEP, 31e {\'e}dition), Traitement Automatique des Langues Naturelles (TALN, 27e {\'e}dition), Rencontre des {\'E}tudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (R{\'E}CITAL, 22e {\'e}dition). Volume 2: Traitement Automatique des Langues Naturelles},
  pages         = {268--278},
  year          = {2020},
  organization  = {ATALA}
}
Owner
GETALP
Study Group for Machine Translation and Automated Processing of Languages and Speech
GETALP
Unsupervised text tokenizer focused on computational efficiency

YouTokenToMe YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE)

VK.com 847 Dec 19, 2022
超轻量级bert的pytorch版本,大量中文注释,容易修改结构,持续更新

bert4pytorch 2021年8月27更新: 感谢大家的star,最近有小伙伴反映了一些小的bug,我也注意到了,奈何这个月工作上实在太忙,更新不及时,大约会在9月中旬集中更新一个只需要pip一下就完全可用的版本,然后会新添加一些关键注释。 再增加对抗训练的内容,更新一个完整的finetune

muqiu 317 Dec 18, 2022
A retro text-to-speech bot for Discord

hawking A retro text-to-speech bot for Discord, designed to work with all of the stuff you might've seen in Moonbase Alpha, using the existing command

Nick Schorr 23 Dec 25, 2022
Snowball compiler and stemming algorithms

Snowball is a small string processing language for creating stemming algorithms for use in Information Retrieval, plus a collection of stemming algori

Snowball Stemming language and algorithms 613 Jan 07, 2023
Universal Adversarial Triggers for Attacking and Analyzing NLP (EMNLP 2019)

Universal Adversarial Triggers for Attacking and Analyzing NLP This is the official code for the EMNLP 2019 paper, Universal Adversarial Triggers for

Eric Wallace 248 Dec 17, 2022
Utilities for preprocessing text for deep learning with Keras

Note: This utility is really old and is no longer maintained. You should use keras.layers.TextVectorization instead of this. Utilities for pre-process

Hamel Husain 180 Dec 09, 2022
Azure Text-to-speech service for Home Assistant

Azure Text-to-speech service for Home Assistant The Azure text-to-speech platform uses online Azure Text-to-Speech cognitive service to read a text wi

Yassine Selmi 2 Aug 06, 2022
SGMC: Spectral Graph Matrix Completion

SGMC: Spectral Graph Matrix Completion Code for AAAI21 paper "Scalable and Explainable 1-Bit Matrix Completion via Graph Signal Learning". Data Format

Chao Chen 8 Dec 12, 2022
Fast, DB Backed pretrained word embeddings for natural language processing.

Embeddings Embeddings is a python package that provides pretrained word embeddings for natural language processing and machine learning. Instead of lo

Victor Zhong 212 Nov 21, 2022
ZUNIT - Toward Zero-Shot Unsupervised Image-to-Image Translation

ZUNIT Dependencies you can install all the dependencies by pip install -r requirements.txt Datasets Download CUB dataset. Unzip the birds.zip at ./da

Chen Yuanqi 9 Jun 24, 2022
PyTorch implementation of the paper: Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding

Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding This repository contains the official PyTorch implementation of th

Xiao Xu 26 Dec 14, 2022
Yet Another Sequence Encoder - Encode sequences to vector of vector in python !

Yase Yet Another Sequence Encoder - encode sequences to vector of vectors in python ! Why Yase ? Yase enable you to encode any sequence which can be r

Pierre PACI 12 Aug 19, 2021
Perform sentiment analysis on textual data that people generally post on websites like social networks and movie review sites.

Sentiment Analyzer The goal of this project is to perform sentiment analysis on textual data that people generally post on websites like social networ

Madhusudan.C.S 53 Mar 01, 2022
Residual2Vec: Debiasing graph embedding using random graphs

Residual2Vec: Debiasing graph embedding using random graphs This repository contains the code for S. Kojaku, J. Yoon, I. Constantino, and Y.-Y. Ahn, R

SADAMORI KOJAKU 5 Oct 12, 2022
Smart discord chatbot integrated with Dialogflow to manage different classrooms and assist in teaching!

smart-school-chatbot Smart discord chatbot integrated with Dialogflow to interact with students naturally and manage different classes in a school. De

Tom Huynh 5 Oct 24, 2022
Trains an OpenNMT PyTorch model and SentencePiece tokenizer.

Trains an OpenNMT PyTorch model and SentencePiece tokenizer. Designed for use with Argos Translate and LibreTranslate.

Argos Open Tech 61 Dec 13, 2022
Speech Recognition for Uyghur using Speech transformer

Speech Recognition for Uyghur using Speech transformer Training: this model using CTC loss and Cross Entropy loss for training. Download pretrained mo

Uyghur 11 Nov 17, 2022
Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation This is the implementaion of our paper: Bridging the

hezw.tkcw 20 Dec 12, 2022
Journalism AI – Quotes extraction for modular journalism

Quote extraction for modular journalism (JournalismAI collab 2021)

Journalism AI collab 2021 207 Dec 25, 2022
Speech Recognition Database Management with python

Speech Recognition Database Management The main aim of this project is to recogn

Abhishek Kumar Jha 2 Feb 02, 2022