GULAG: GUessing LAnGuages with neural networks

Related tags

Deep Learninggulag
Overview

GULAG: GUessing LAnGuages with neural networks

Main Code style: black Checked with mypy GitHub license GitHub stars

cannon on sparrows

Classify languages in text via neural networks.

> Привет! My name is Egor. Was für ein herrliches Frühlingswetter, хутка расцвітуць дрэвы.
ru -- Привет
en -- My name is Egor
de -- Was für ein herrliches Frühlingswetter
be -- хутка расцвітуць дрэвы

Usage

Use requirements.txt to install necessary dependencies:

pip install -r requirements.txt

After that you can either train model:

python -m src.main train --gin-file config/train.gin

Or run inference:

python -m src.main infer

Training

All training details are covered by PyTorch-Lightning. There are:

Both modules have explicit documentation, see source files for usage details.

Dataset

Since extracting languages from a text is a kind of synthetic task, then there is no exact dataset of that. A possible approach to handle this is to use general multilingual corpses to create a synthetic dataset with multiple languages per one text. Although there is a popular mC4 dataset with large texts in over 100 languages. It is too large for this pet project. Therefore, I used wikiann dataset that also supports over 100 languages including Russian, Ukrainian, Belarusian, Kazakh, Azerbaijani, Armenian, Georgian, Hebrew, English, and German. But this dataset consists of only small sentences for NER classification that make it more unnatural.

Synthetic data

To create a dataset with multiple languages per example, I use the following sampling strategy:

  1. Select number of languages in next example
  2. Select number of sentences for each language
  3. Sample sentences, shuffle them and concatenate into single text

For exact details about sampling algorithm see generate_example method.

This strategy allows training on a large non-repeating corpus. But for proper evaluation during training, we need a deterministic subset of data. For that, we can pre-generate a bunch of texts and then reuse them on each validation.

Model

As a training objective, I selected per-token classification. This automatically allows not only classifying languages in the text, but also specifying their ranges.

The model consists of two parts:

  1. The backbone model that embeds tokens into vectors
  2. Head classifier that predicts classes by embedding vector

As backbone model I selected vanilla BERT. This model already pretrained on large multilingual corpora including non-popular languages. During training on a target task, weights of BERT were frozen to enhance speed.

Head classifier is a simple MLP, see TokenClassifier for details.

Configuration

To handle big various of parameters, I used gin-config. config folder contains all configurations split by modules that used them.

Use --gin-file CLI argument to specify config file and --gin-param to manually overwrite some values. For example, to run debug mode on a small subset with a tiny model for 10 steps use

python -m src.main train --gin-file config/debug.gin --gin-param="train.n_steps = 10"

You can also use jupyter notebook to run training, this is a convenient way to train with Google Colab. See train.ipynb.

Artifacts

All training logs and artifacts are stored on W&B. See voudy/gulag for information about current runs, their losses and metrics. Any of the presented models may be used on inference.

Inference

In inference mode, you may play with the model to see whether it is good or not. This script requires a W&B run path where checkpoint is stored and checkpoint name. After that, you can interact with a model in a loop.

The final model is stored in voudy/gulag/a55dbee8 run. It was trained for 20 000 steps for ~9 hours on Tesla T4.

$ python -m src.main infer --wandb "voudy/gulag/a55dbee8" --ckpt "step_20000.ckpt"
...
Enter text to classify languages (Ctrl-C to exit):
> İrəli! Вперёд! Nach vorne!
az -- İrəli
ru -- Вперёд
de -- Nach vorne
Enter text to classify languages (Ctrl-C to exit):
> Давайте жити дружно
uk -- Давайте жити дружно
> ...

For now, text preprocessing removes all punctuation and digits. It makes the data more robust. But restoring them back is a straightforward technical work that I was lazy to do.

Of course, you can use model from the Jupyter Notebooks, see infer.ipynb

Further work

Next steps may include:

  • Improved dataset with more natural examples, e.g. adopt mC4.
  • Better tokenization to handle rare languages, this should help with problems on the bounds of similar texts.
  • Experiments with another embedders, e.g. mGPT-3 from Sber covers all interesting languages, but requires technical work to adopt for classification task.
Owner
Egor Spirin
DL guy
Egor Spirin
Code for Temporally Abstract Partial Models

Code for Temporally Abstract Partial Models Accompanies the code for the experimental section of the paper: Temporally Abstract Partial Models, Khetar

DeepMind 19 Jul 13, 2022
PyTorch implementation for our paper "Deep Facial Synthesis: A New Challenge"

FSGAN Here is the official PyTorch implementation for our paper "Deep Facial Synthesis: A New Challenge". This project achieve the translation between

Deng-Ping Fan 32 Oct 10, 2022
Annotate with anyone, anywhere.

h h is the web app that serves most of the https://hypothes.is/ website, including the web annotations API at https://hypothes.is/api/. The Hypothesis

Hypothesis 2.6k Jan 08, 2023
TJU Deep Learning & Neural Network

Deep_Learning & Neural_Network_Lab 实验环境 Python 3.9 Anaconda3(官网下载或清华镜像都行) PyTorch 1.10.1(安装代码如下) conda install pytorch torchvision torchaudio cudatool

St3ve Lee 1 Jan 19, 2022
Official implementation of the paper WAV2CLIP: LEARNING ROBUST AUDIO REPRESENTATIONS FROM CLIP

Wav2CLIP 🚧 WIP 🚧 Official implementation of the paper WAV2CLIP: LEARNING ROBUST AUDIO REPRESENTATIONS FROM CLIP 📄 🔗 Ho-Hsiang Wu, Prem Seetharaman

Descript 240 Dec 13, 2022
Using some basic methods to show linkages and transformations of robotic arms

roboticArmVisualizer Python GUI application to create custom linkages and adjust joint angles. In the future, I plan to add 2d inverse kinematics solv

Sandesh Banskota 1 Nov 19, 2021
House-GAN++: Generative Adversarial Layout Refinement Network towards Intelligent Computational Agent for Professional Architects

House-GAN++ Code and instructions for our paper: House-GAN++: Generative Adversarial Layout Refinement Network towards Intelligent Computational Agent

122 Dec 28, 2022
The codes and models in 'Gaze Estimation using Transformer'.

GazeTR We provide the code of GazeTR-Hybrid in "Gaze Estimation using Transformer". We recommend you to use data processing codes provided in GazeHub.

65 Dec 27, 2022
Tree LSTM implementation in PyTorch

Tree-Structured Long Short-Term Memory Networks This is a PyTorch implementation of Tree-LSTM as described in the paper Improved Semantic Representati

Riddhiman Dasgupta 529 Dec 10, 2022
This is a project based on retinaface face detection, including ghostnet and mobilenetv3

English | 简体中文 RetinaFace in PyTorch Chinese detailed blog:https://zhuanlan.zhihu.com/p/379730820 Face recognition with masks is still robust---------

pogg 59 Dec 21, 2022
Organseg dags - The repository contains the codebase for multi-organ segmentation with directed acyclic graphs (DAGs) in CT.

Organseg dags - The repository contains the codebase for multi-organ segmentation with directed acyclic graphs (DAGs) in CT.

yzf 1 Jun 12, 2022
A series of Jupyter notebooks with Chinese comment that walk you through the fundamentals of Machine Learning and Deep Learning in python using Scikit-Learn and TensorFlow.

Hands-on-Machine-Learning 目的 这份笔记旨在帮助中文学习者以一种较快较系统的方式入门机器学习, 是在学习Hands-on Machine Learning with Scikit-Learn and TensorFlow这本书的 时候做的个人笔记: 此项目的可取之处 原书的

Baymax 1.5k Dec 21, 2022
[CVPR 2021 Oral] Variational Relational Point Completion Network

VRCNet: Variational Relational Point Completion Network This repository contains the PyTorch implementation of the paper: Variational Relational Point

PL 121 Dec 12, 2022
This repository contains the code used in the paper "Prompt-Based Multi-Modal Image Segmentation".

Prompt-Based Multi-Modal Image Segmentation This repository contains the code used in the paper "Prompt-Based Multi-Modal Image Segmentation". The sys

Timo Lüddecke 305 Dec 30, 2022
Active Offline Policy Selection With Python

Active Offline Policy Selection This is supporting example code for NeurIPS 2021 paper Active Offline Policy Selection by Ksenia Konyushkova*, Yutian

DeepMind 27 Oct 15, 2022
HistoKT: Cross Knowledge Transfer in Computational Pathology

HistoKT: Cross Knowledge Transfer in Computational Pathology Exciting News! HistoKT has been accepted to ICASSP 2022. HistoKT: Cross Knowledge Transfe

Mahdi S. Hosseini 5 Jan 05, 2023
This repository is an implementation of paper : Improving the Training of Graph Neural Networks with Consistency Regularization

CRGNN Paper : Improving the Training of Graph Neural Networks with Consistency Regularization Environments Implementing environment: GeForce RTX™ 3090

THUDM 28 Dec 09, 2022
Python Fanduel API (2021) - Lineup Automation

Southpaw is a python package that provides access to the Fanduel API. Optimize your DFS experience by programmatically updating your lineups, analyzin

Brandin Canfield 13 Jan 04, 2023
[CVPR 2021] Pytorch implementation of Hijack-GAN: Unintended-Use of Pretrained, Black-Box GANs

Hijack-GAN: Unintended-Use of Pretrained, Black-Box GANs In this work, we propose a framework HijackGAN, which enables non-linear latent space travers

Hui-Po Wang 46 Sep 05, 2022
Hcpy - Interface with Home Connect appliances in Python

Interface with Home Connect appliances in Python This is a very, very beta inter

Trammell Hudson 116 Dec 27, 2022