Transformer on a Diet

Reference: C Wang, Z Ye, A Zhang, Z Zhang, A Smola. "Transformer on a Diet". arXiv preprint arXiv (2020).

Installation

pip install --pre --upgrade mxnet
pip install gluonnlp

Results

The results and the command line to reproduce the results on PTB dataset are as follows.

[1] Full (Val PPL 109.19 Test PPL 103.72)

$ cd scripts/language_model/
$ python transformer_language_model.py --model full --data ptb --emsize 320 --nhid 2000 --nlayers 3 --lr 10 --epochs 500 --batch_size 20 --bptt 70 --dropout 0.4 --dropout_h 0.25 --dropout_i 0 --dropout_e 0 --weight_drop 0 --tied --alpha 0 --beta 0 --lr_update_interval 100 --lr_update_factor 1 --num_heads 16 --scaled --units 320 --use_residual --max_src_length 1000 --warmup_steps 0 --first_window_size 1 --kernel_size 3 --d_base 2

[2] Dilated (Val PPL 115.67 Test PPL 110.92)

$ cd scripts/language_model/
$ python transformer_language_model.py --model dilated --data ptb --emsize 320 --nhid 2000 --nlayers 3 --lr 10 --epochs 500 --batch_size 20 --bptt 70 --dropout 0.4 --dropout_h 0.25 --dropout_i 0 --dropout_e 0 --weight_drop 0 --tied --alpha 0 --beta 0 --lr_update_interval 100 --lr_update_factor 1 --num_heads 16 --scaled --units 320 --use_residual --max_src_length 1000 --warmup_steps 0 --first_window_size 1 --kernel_size 3 --d_base 2

[3] Dilated-Memory (Val PPL 115.35 Test PPL 110.98)

$ cd scripts/language_model/
$ python transformer_language_model.py --model dilated_mem --data ptb --emsize 320 --nhid 2000 --nlayers 3 --lr 10 --epochs 500 --batch_size 20 --bptt 70 --dropout 0.4 --dropout_h 0.25 --dropout_i 0 --dropout_e 0 --weight_drop 0 --tied --alpha 0 --beta 0 --lr_update_interval 100 --lr_update_factor 1 --num_heads 16 --scaled --units 320 --use_residual --max_src_length 1000 --warmup_steps 0 --first_window_size 1 --kernel_size 3 --d_base 2

[4] Cascade (Val PPL 109.16 Test PPL 105.27)

$ cd scripts/language_model/
$ python transformer_language_model.py --model cascade --data ptb --emsize 320 --nhid 2000 --nlayers 3 --lr 10 --epochs 500 --batch_size 20 --bptt 70 --dropout 0.4 --dropout_h 0.25 --dropout_i 0 --dropout_e 0 --weight_drop 0 --tied --alpha 0 --beta 0 --lr_update_interval 100 --lr_update_factor 1 --num_heads 16 --scaled --units 320 --use_residual --max_src_length 1000 --warmup_steps 0 --first_window_size 4 --window_size_multiplier 2 --kernel_size 3 --d_base 2

Note that the command to reproduce the results on wikitext-2 would be updated soon.

Reference Paper

The bibtext entry of the reference paper is:

@article{transformerdiet2020,
   title={Transformer on a Diet},
   author={Chenguang Wang and Zihao Ye and Aston Zhang and Zheng Zhang and Alexander J. Smola},
   journal={ArXiv},
   year={2020},
   volume={abs/2002.06170}
}

Code repo for "Transformer on a Diet" paper

Related tags

Overview

Transformer on a Diet

Installation

Results

Reference Paper

Owner

cgraywang

StyleGAN2 Webtoon / Anime Style Toonify

PyTorch implementation of Spiking Neural Networks trained on surrogate gradient & BPTT using snntorch.

Send text to girlfriend in the morning

A lightweight Python-based 3D network multi-agent simulator. Uses a cell-based congestion model. Calculates risk, loudness and battery capacities of the agents. Suitable for 3D network optimization tasks.

Turning SymPy expressions into PyTorch modules.

Official pytorch code for SSAT: A Symmetric Semantic-Aware Transformer Network for Makeup Transfer and Removal

Camview - A CLI-tool used to stream CCTV online footage based on URL params

Source code and Dataset creation for the paper "Neural Symbolic Regression That Scales"

Seg-Torch for Image Segmentation with Torch

Defending against Model Stealing via Verifying Embedded External Features

FAST-RIR: FAST NEURAL DIFFUSE ROOM IMPULSE RESPONSE GENERATOR

PyTorch implementations of the beta divergence loss.

Implementation of "Learning to Match Features with Seeded Graph Matching Network" ICCV2021

Official code for paper "Optimization for Oriented Object Detection via Representation Invariance Loss".

Attentional Focus Modulates Automatic Finger‑tapping Movements

Social Fabric: Tubelet Compositions for Video Relation Detection

Repository relating to the CVPR21 paper TimeLens: Event-based Video Frame Interpolation

[ACM MM 2019 Oral] Cycle In Cycle Generative Adversarial Networks for Keypoint-Guided Image Generation

Neural Geometric Level of Detail: Real-time Rendering with Implicit 3D Shapes (CVPR 2021 Oral)

Instance-wise Occlusion and Depth Orders in Natural Scenes (CVPR 2022)