The RWKV Language Model

Last update: Jan 05, 2023

Related tags

Overview

RWKV-LM

We propose the RWKV language model, with alternating time-mix and channel-mix layers:

$\begin{align*} \text{Time-mix :} && \text{TM}_{t,c} &&=&&\text{sigmoid}(\text{R}_{t,c}) &&\cdot&& &&\textstyle\sum_{u} &&\textbf{W}_{t,u,c} &&\cdot&& \text{softmax}_t(\text{K}_{u,c}) &&\cdot&& \text{V}_{u,c}\\ \text{Channel-mix :} && \text{CM}_{t,c} &&=&&\text{sigmoid}(\text{R}_{t,c}) &&\cdot&& &&\textstyle\sum_d &&\textbf{W}_{c,d} &&\cdot&& \text{gelu}(\text{K}_{t,d}) &&\cdot&& \text{V}_{t,d} \end{align*}$

The R, K, V are generated by linear transforms of input, and W is parameter. The idea of RWKV is to decompose attention into R(target) * W(src, target) * K(src). So we can call R "receptance", and sigmoid means it's in 0~1 range.
The Time-mix is similar to AFT (https://arxiv.org/abs/2105.14103). There are two differences.

(1) We changed the normalization (denominator). For masked language models, we define:

$\text{softmax}_t(\text{K}_{u,c}) = \frac{\exp(\text{K}_{u,c})}{\sum_{v \leq t}\exp(\text{K}_{v,c})}$

(2) We decompose W_{t,u,c} and introduce multi-head W (here h is the corresponding head of c):

$W_{t,u,c}=f_h(t-u)\cdot \alpha_h(u) \cdot \beta_h(t)$

(3) You don't need LayerNorm for Time-mix. In fact, the model converges faster when LayerNorm is removed.

Moreover we multiply the final output of Time-mix layer by γ(t). The reason for the α β γ factors, is because the context size is smaller when t is small, and this can be compensated using the α β γ factors.

The Channel-mix is similar to GeGLU (https://arxiv.org/abs/2002.05202) with an extra R factor.
Finally, we add extra time-mixing as in (https://github.com/BlinkDL/minGPT-tuned). You can try reducing the amt of time-mixing in upper layers of deep models.

We also propose a new sampling method (as in src/utils.py):

(1) Find the max probability p_max after softmax.

(2) Remove all entries whose probability is lower than 0.02 * pow(p_max, 2)

(3) Feel free to tune the 0.02 and 2 factor.

Training loss, RWKV vs MHA+Rotary+GeGLU:

(this is character-level loss with simplebooks-92 dataset https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip)

Comments

Sequence to Sequence?

Hey @BlinkDL! Awesome project!

I was wondering if you have performed any Seq-2-Seq experiments with it? Any reason for going with GPT model in the first place as opposed to something like T5 (standard Transformer)? Any direction on what changes will be required to make a standard encoder-decoder architecture with RWKV?

Also, is there any report on in-context-learning/FSL capability of the latest trained model?

opened by SushantDaga 2
v4 model.py vs model_run.py

Hi, Thanks for this awesome repo! I'm trying to understand the code and found that in the v4 folder, there's this model.py and model_run.py, which contains GPT and RWKV_GPT respectively which all uses different initialization methods. Could you elaborate on when should which one be used? Thanks in advnace!

opened by jingweiz 3
RWKV-4 169m/430m in browser with ORT Web / TF.js / tfjs-tflite?

Hi, really exciting project! I'm wondering if you've published the model conversion script that you used to create the js_models files from the .pth model file? It would be awesome to see how the larger and newer models like RWKV-4 169m/430m perform in the browser! I think the inference speed of RWKV opens up many new possibilities for language models on the web.

opened by josephrocca 32

CUDA compilation error with Ctx Length>2000

Hello, I am trying out RWKV with audio modality and when I set T_MAX>>1000, it throws this error:

Emitting ninja build file /root/.cache/torch_extensions/py39_cu116/timex/build.ninja...
Building extension module timex...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=timex -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/surya-env/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' --use_fast_math --extra-device-vectorization -DTmax=10000 -DBF=8 -DBB=2 -std=c++14 -c cuda/timex_cuda.cu -o timex_cuda.cuda.o 
FAILED: timex_cuda.cuda.o 
/usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=timex -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/surya-env/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' --use_fast_math --extra-device-vectorization -DTmax=10000 -DBF=8 -DBB=2 -std=c++14 -c cuda/timex_cuda.cu -o timex_cuda.cuda.o 
ptxas error   : Entry function '_Z15kernel_backwardIfEvPKT_S2_S2_PS0_S3_iii' uses too much shared data (0x30d40 bytes, 0xc000 max)
ptxas error   : Entry function '_Z14kernel_forwardIfEvPKT_S2_PS0_S0_iii' uses too much shared data (0x57e40 bytes, 0xc000 max)
ninja: build stopped: subcommand failed.

GPU: A100, VRAM: 42GB, CUDA 11.6

I am okay if the training takes a bit long. But I need this to work. Don't know any CUDA. Can you suggest some workarounds?

Thanks for the incredible work btw!

opened by ojus1 8

关于调用模型做分类任务

你好作者！我对此工作很感兴趣，因为我现在在用基于transformer的模型做分类任务，transformer或者RNN在分类任务里通常采用最后一个模块的每个通道的最后一个元素作为输出，并通过全连接层映射到几个类别。请问你觉得RWKV原理类似吗？依旧提取最后一个元素作为输出是否稳妥呢？希望您能给出一些建议，我将很感激！

opened by louisinhit 2

Releases(4.00)

4.00(Dec 6, 2022)

Just a stable release.
Source code(tar.gz)
Source code(zip)
2.00(Mar 25, 2022)

Attached model : ctx1024-layer6-emb512 on enwik8 with 1.65 dev perplexity (0.72 BPC)
Source code(tar.gz)
Source code(zip)
enwik8-ppl1.65-6064-1024-RWKV-6-512-2022-03-25-21-05-13.zip(95.42 MB)
0.02(Aug 25, 2021)

v0.02 with RWKV, MHA_shift, MHA_rotary, MHA_pro, and time-shift mixing.
Source code(tar.gz)
Source code(zip)
0.01(Aug 13, 2021)

first release with RWKV, MHA_rotary, MHA_pro, and time-shift mixing.
Source code(tar.gz)
Source code(zip)

Owner

PENG Bo

http://zhihu.com/people/bopengbopeng

GitHub Repository

AudioCLIP Extending CLIP to Image, Text and Audio

AudioCLIP Extending CLIP to Image, Text and Audio This repository contains implementation of the models described in the paper arXiv:2106.13043. This

458 Jan 02, 2023

A repo for open resources & information for people to succeed in PhD in CS & career in AI / NLP

420 Dec 28, 2022

hashily is a Python module that provides a variety of text decoding and encoding operations.

hashily is a python module that performs a variety of text decoding and encoding functions. It also various functions for encrypting and decrypting text using various ciphers.

5 Jul 17, 2022

All the code I wrote for Overwatch-related projects that I still own the rights to.

overwatch_shit.zip This is (eventually) going to contain all the software I wrote during my five-year imprisonment stay playing Overwatch. I'll be add

2 Dec 31, 2021

Semantic search through a vectorized Wikipedia (SentenceBERT) with the Weaviate vector search engine

Semantic search through Wikipedia with the Weaviate vector search engine Weaviate is an open source vector search engine with build-in vectorization a

191 Dec 26, 2022

Phomber is infomation grathering tool that reverse search phone numbers and get their details, written in python3.

A Infomation Grathering tool that reverse search phone numbers and get their details ! What is phomber? Phomber is one of the best tools available fo

121 Dec 27, 2022

Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

2.1k Jan 01, 2023

CCF BDCI BERT系统调优赛题baseline（Pytorch版本）

CCF BDCI BERT系统调优赛题baseline（Pytorch版本）此版本基于Pytorch后端的huggingface进行实现。由于此实现使用了Oneflow的dataloader作为数据读入的方式，因此也需要安装Oneflow。其它框架的数据读取可以参考OneflowDataloade

9 Oct 13, 2022

A Python script that compares files in directories

compare-files A Python script that compares files in different directories, this is similar to the command filecmp.cmp(f1, f2). I made this script in

1 Oct 15, 2021

Chinese named entity recognization (bert/roberta/macbert/bert_wwm with Keras)

2 Jul 05, 2022

This repo is to provide a list of literature regarding Deep Learning on Graphs for NLP

230 Nov 22, 2022

Using BERT-based models for toxic span detection

SemEval 2021 Task 5: Toxic Spans Detection: Task: Link to SemEval-2021: Task 5 Toxic Span Detection is https://competitions.codalab.org/competitions/2

1 Jan 04, 2022

An implementation of the Pay Attention when Required transformer

Pay Attention when Required (PAR) Transformer-XL An implementation of the Pay Attention when Required transformer from the paper: https://arxiv.org/pd

7 Aug 11, 2022

Proquabet - Convert your prose into proquints and then you essentially have Vogon poetry

Proquabet Turn your prose into a constant stream of encrypted and meaningless-so

2 Oct 10, 2022

BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents

BROS (BERT Relying On Spatiality) is a pre-trained language model focusing on text and layout for better key information extraction from documents. Given the OCR results of the document image, which

94 Dec 30, 2022

Active learning for text classification in Python

Active Learning allows you to efficiently label training data in a small-data scenario.

375 Dec 28, 2022

pyupbit 라이브러리를 활용하여 upbit에서 비트코인을 자동매매하는 코드입니다. 조코딩 유튜브 채널에서 자세한 강의 영상을 보실 수 있습니다.

파이썬 비트코인 투자 자동화 강의 코드 by 유튜브 조코딩 채널 pyupbit 라이브러리를 활용하여 upbit 거래소에서 비트코인 자동매매를 하는 코드입니다. 파일 구성 test.py : 잔고 조회 (1강) backtest.py : 백테스팅 코드 (2강) bestK.p

186 Dec 29, 2022

Pipeline for training LSA models using Scikit-Learn.

Latent Semantic Analysis Pipeline for training LSA models using Scikit-Learn. Usage Instead of writing custom code for latent semantic analysis, you j

23 Sep 05, 2022

Unofficial PyTorch implementation of Google AI's VoiceFilter system

VoiceFilter Note from Seung-won (2020.10.25) Hi everyone! It's Seung-won from MINDs Lab, Inc. It's been a long time since I've released this open-sour

881 Jan 03, 2023

This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

Word-Level Coreference Resolution This is a repository with the code to reproduce the experiments described in the paper of the same name, which was a

79 Dec 27, 2022