Crosslingual Segmental Language Model

This repository contains the code from Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages (2021, C.M. Downey, Shannon Drizin, Levon Haroutunian, and Shivin Thukral). The code here is a modified version of the repository from the original MSLM paper. The mslm package can be used to train and use Segmental Language Models.

In this repository, we additionally make available our preparation of the AmericasNLP 2021 multilingual dataset (see Data/AmericasNLP) and the target K'iche' data (Data/GlobalClassroom).

Paper Results

The results from the accompanying paper can be found in the Output directory. *.csv files include statistics from the training run, *.out contain the model output for the entire corpus, *.score contain the segmentation scores of the model output.

The results from the October 2021 pre-print (which we will refer to as Experiment Set A) are reproducible on commit 2b89575. We will consider this the official commit of the October 2021 pre-print.

Usage

The top-level scripts for training and experimentation can be found in RunScripts. Almost all functionality is run through the __main__.py script in the mslm package, which can either train or evaluate/use a model. The PyTorch modules for building SLMs can be found in mslm.segmental_lm, modules for the span-masking Transformer are in mslm.segmental_transformer, and modules for sequence lattice-based computations are in mslm.lattice. The main script takes in a configuration object to set most parameters for model training and use (see mslm.mslm_config). For information on the arguments to the main script:

python -m mslm --help

Environment setup

pip install -r requirements.txt

This code requires Python >= 3.6

Training

./RunScripts/run_mslm.sh

python -m mslm --input_file 
   
     \
    --model_path 
    
      \
    --mode train \
    --config_file 
     
       \
    --dev_file 
      
        \
    [--preexisting]

Evaluation

./RunScripts/eval_mslm.sh

Where is a text file containing all of the words from the training set

Crosslingual Segmental Language Model

Related tags

Overview

Crosslingual Segmental Language Model

Paper Results

Usage

Environment setup

Training

Evaluation

Owner

C.M. Downey

MiniHack the Planet: A Sandbox for Open-Ended Reinforcement Learning Research

MultiSiam: Self-supervised Multi-instance Siamese Representation Learning for Autonomous Driving

Object Detection and Multi-Object Tracking

JAX bindings to the Flatiron Institute Non-uniform Fast Fourier Transform (FINUFFT) library

TorchMetrics is a collection of 25+ PyTorch metrics implementations and an easy-to-use API to create custom metrics.

Material del curso IIC2233 Programación Avanzada 📚

Utilities and information for the signals.numer.ai tournament

Human Dynamics from Monocular Video with Dynamic Camera Movements

Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

This project is used for the paper Differentiable Programming of Isometric Tensor Network

Optimal Camera Position for a Practical Application of Gaze Estimation on Edge Devices,

links and status of cool gradio demos

Pyeventbus: a publish/subscribe event bus

Oriented Object Detection: Oriented RepPoints + Swin Transformer/ReResNet

TensorFlow implementation of Style Transfer Generative Adversarial Networks: Learning to Play Chess Differently.

CLIP+FFT text-to-image

Clockwork Variational Autoencoder

This repository contains all code and data for the Inside Out Visual Place Recognition task

Convolutional Neural Network to detect deforestation in the Amazon Rainforest

A different spin on dataclasses.