SmallInitEmb - LayerNorm(SmallInit(Embedding)) in a Transformer to improve convergence

Last update: Dec 25, 2022

Related tags

Overview

SmallInitEmb

LayerNorm(SmallInit(Embedding)) in a Transformer

I find that when training a transformer, the embedding matrix moves slowly, hence it's difficult for the model to jump out of the initial noisy embedding.

(initial embedding)
[[-0.0073  0.0062 -0.0261 ...  0.0086  0.0107 -0.008 ] ... ]
 (after 1 step, the directions of the embedding vectors are not moved much because the numbers change by ~LR = ~4e-4)
[[-0.0069  0.0066 -0.0265 ...  0.009   0.0111 -0.0084] ... ]

So I propose initializing the embedding matrix to tiny values, and put another LayerNorm after it (before all the SA & FFN layers):

if isinstance(module, (nn.Embedding)):
    nn.init.uniform_(module.weight, a=-1e-4, b=1e-4) # SmallInit(Emb)
...
if self.config.USE_SMALL_EMB and self.layer_id == 0:
    x = self.lnPre(x) # LN(SmallInit(Emb))
x = x + self.att(self.ln1(x))
x = x + self.ffn(self.ln2(x))

And then you get improved convergence (especially for BPE models) because the model can quickly jump out of the tiny initial embedding (small changes after 1 step -> significant changes of directions -> significant changes after LayerNorm).

Loss curve comparison: https://wandb.ai/blinkdl/SmallEmbTest

(the gap between LayerNorm(SmallEmb)) and baseline persists after more training)

Moreover, you can directly train PostLN models without warmup with SmallInit(Emb)

if isinstance(module, (nn.Embedding)):
    nn.init.uniform_(module.weight, a=-1e-4, b=1e-4) # SmallInit(Emb)
...
x = self.ln1(x) # this plays the same role as the lnPre in the above PreLN code
x = x + self.att(x)
x = self.ln2(x)
x = x + self.ffn(x)
(note you shall have another LN after the final ffn)

SmallInitEmb - LayerNorm(SmallInit(Embedding)) in a Transformer to improve convergence

Related tags

Overview

SmallInitEmb

Moreover, you can directly train PostLN models without warmup with SmallInit(Emb)

Owner

PENG Bo

Official PyTorch implementation for "Low Precision Decentralized Distributed Training with Heterogenous Data"

This is the repository for Learning to Generate Piano Music With Sustain Pedals

Generalizing Gaze Estimation with Outlier-guided Collaborative Adaptation

ICS 4u HD project, start before-wards. A curtain shooting game using python.

Spam your friends and famly and when you do your famly will disown you and you will have no friends.

DPC: Unsupervised Deep Point Correspondence via Cross and Self Construction (3DV 2021)

Code for the paper "Zero-shot Natural Language Video Localization" (ICCV2021, Oral).

Official implementation for “Unsupervised Low-Light Image Enhancement via Histogram Equalization Prior”

Official code for "Mean Shift for Self-Supervised Learning"

[ICCV 2021] Official Tensorflow Implementation for "Single Image Defocus Deblurring Using Kernel-Sharing Parallel Atrous Convolutions"

Pre-Training 3D Point Cloud Transformers with Masked Point Modeling

An inofficial PyTorch implementation of PREDATOR based on KPConv.

MobileNetV1-V2，MobileNeXt，GhostNet，AdderNet，ShuffleNetV1-V2，Mobile+ViT etc.

This is the official implementation of VaxNeRF (Voxel-Accelearated NeRF).

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more

Revitalizing CNN Attention via Transformers in Self-Supervised Visual Representation Learning

The official implementation of Variable-Length Piano Infilling (VLI).

Project for music generation system based on object tracking and CGAN

pix2pix in tensorflow.js

Set of models for classifcation of 3D volumes