Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Last update: Jan 05, 2023

Overview

Memory Efficient Attention Pytorch

Implementation of a memory efficient multi-head attention as proposed in the paper, Self-attention Does Not Need O(n²) Memory. In addition, the module will take care of masking, causal masking, as well as cross attention.

Install

$ pip install memory-efficient-attention-pytorch

Usage

For autoregressive language model

import torch
from memory_efficient_attention_pytorch import Attention

attn = Attention(
    dim = 512,
    dim_head = 64,                # dimension per head
    heads = 8,                    # number of attention heads
    causal = True,                # autoregressive or not
    memory_efficient = True,      # whether to use memory efficient attention (can be turned off to test against normal attention)
    q_bucket_size = 1024,         # bucket size along queries dimension
    k_bucket_size = 2048          # bucket size along key / values dimension
).cuda()

x = torch.randn(1, 65536, 512).cuda()
out = attn(x) # (1, 65536, 512)

Cross attention

import torch
from memory_efficient_attention_pytorch import Attention

cross_attn = Attention(
    dim = 512,
    dim_head = 64,
    heads = 8,
    memory_efficient = True,
    q_bucket_size = 1024,
    k_bucket_size = 2048
).cuda()

x = torch.randn(1, 65536, 512).cuda()
context = torch.randn(1, 65536, 512).cuda()
mask = torch.ones(1, 65536).bool().cuda()

out = cross_attn(x, context = context, mask = mask) # (1, 65536, 512)

benchmark and see how much torch jit helps
look at Triton and Keops and see if either can be a fit

Citations

@misc{rabe2021selfattention,
    title   = {Self-attention Does Not Need $O(n^2)$ Memory}, 
    author  = {Markus N. Rabe and Charles Staats},
    year    = {2021},
    eprint  = {2112.05682},
    archivePrefix = {arXiv},
    primaryClass = {cs.LG}
}

@misc{liu2021swin,
    title   = {Swin Transformer V2: Scaling Up Capacity and Resolution},
    author  = {Ze Liu and Han Hu and Yutong Lin and Zhuliang Yao and Zhenda Xie and Yixuan Wei and Jia Ning and Yue Cao and Zheng Zhang and Li Dong and Furu Wei and Baining Guo},
    year    = {2021},
    eprint  = {2111.09883},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

Comments

[feature request] Combining with flash attention?

There is a new algorithm to optimize the qkv attention, https://github.com/HazyResearch/flash-attention https://arxiv.org/abs/2205.14135 It optimises the qkv attention part. Maybe you can look into integrating it with this.

opened by Vbansal21 15
i did this, we could build on top

Hi there!

It seems I did already some of the code... https://github.com/CHARM-Tx/linear_mem_attention_pytorch could we build on top of this? I talked to https://github.com/Chillee about an experimental functionality from functorch: https://github.com/pytorch/functorch that would allow for increased speed (mainly i want to match jax perofmance but its just difficult w/ pytorch imperative style).

I would love to collaborate on this if you want!

opened by hypnopump 5
Added dropout support to memory efficient variant

Hey Phil,

I have been using this repository for a project and I wanted to add dropout for completeness. I checked consistency with perceiver-ar impl.. I hope this is helpful.

-Matt

opened by usryokousha 2
Making this work with relative position bias from XTransformers

Is there a way to make this work with RelativePositionBias. Currently this produces an attention bias of size $BHN^2$ where B is batch size, H is number of heads and N is input size. Can this be chunked and computed per chunk?

opened by pfeatherstone 5
save_for_backward can only save variables, but argument 5 is of type bool

Hi,

Thank you for your indescribable work. I was trying to test your method specifically for cross-attention but It seems I get the error " save_for_backward can only save variables, but argument 5 is of type bool". I am not sure what I am doing wrong. I tried your own examples too but get the same error.

Can you please help me out?

Code:

import torch from memory_efficient_attention_pytorch import Attention

cross_attn = Attention( dim = 512, dim_head = 64, heads = 8, memory_efficient = True, q_bucket_size = 1024, k_bucket_size = 2048 ).cuda() (# out = sm_mod(inp1)) did this to avoid being a header x = torch.randn(1, 65536, 512).cuda() context = torch.randn(1, 65536, 512).cuda() (# mask = torch.ones(1, 65536).bool().cuda()) did this to avoid being a heading out = cross_attn(x

ERROR:

File "/home/abali/.conda/envs/py38_ydp5/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/abali/.conda/envs/py38_ydp5/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/abali/.vscode-server/extensions/ms-python.python-2022.8.1/pythonFiles/lib/python/debugpy/main.py", line 45, in cli.main() File "/home/abali/.vscode-server/extensions/ms-python.python-2022.8.1/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 444, in main run() File "/home/abali/.vscode-server/extensions/ms-python.python-2022.8.1/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 285, in run_file runpy.run_path(target_as_str, run_name=compat.force_str("main")) File "/home/abali/.conda/envs/py38_ydp5/lib/python3.8/runpy.py", line 265, in run_path return _run_module_code(code, init_globals, run_name, File "/home/abali/.conda/envs/py38_ydp5/lib/python3.8/runpy.py", line 97, in _run_module_code _run_code(code, mod_globals, init_globals, File "/home/abali/.conda/envs/py38_ydp5/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/data/stars/user/abali/Phd_work/ISBI2023/X3D-Multigrid/CrossAttn_X3d_v2.py", line 872, in out = cross_attn(x, context = context, mask = mask) # (1, 65536, 512) print(out) File "/home/abali/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/abali/.conda/envs/py38_ydp5/lib/python3.8/site-packages/memory_efficient_attention_pytorch/memory_efficient_attention.py", line 215, in forward out = attn_fn(q, k, v, mask = mask, attn_bias = attn_bias, causal = self.causal, q_bucket_size = q_bucket_size, k_bucket_size = k_bucket_size) File "/home/abali/.conda/envs/py38_ydp5/lib/python3.8/site-packages/memory_efficient_attention_pytorch/memory_efficient_attention.py", line 127, in memory_efficient_attention exp_weight_chunk, weighted_value_chunk, weight_max_chunk = summarize_qkv_fn( File "/home/abali/.local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 163, in checkpoint return CheckpointFunction.apply(function, preserve, *args) TypeError: save_for_backward can only save variables, but argument 5 is of type bool

opened by aliabid2243 1
Checkpointing is not compatible with .grad() or when an `inputs` parameter is passed to .backward()

https://github.com/lucidrains/memory-efficient-attention-pytorch/blob/35559a05572f9d4eb982a8e2e399b40a2d61b85c/memory_efficient_attention_pytorch/memory_efficient_attention.py#L95

Should this be: summarize_qkv_fn = summarize_qkv_chunk if needs_backwards else checkpointed_summarize_qkv_chunk instead of: summarize_qkv_fn = checkpointed_summarize_qkv_chunk if needs_backwards else summarize_qkv_chunk

opened by vrobot 0

Releases(0.1.1)

0.1.1(Dec 30, 2022)

null
Source code(tar.gz)
Source code(zip)
0.1.0(Dec 30, 2022)

Source code(tar.gz)
Source code(zip)
0.0.27(Nov 1, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.26(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.25(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.24(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.23(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.22(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.21(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.20(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.19(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.18(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.17(Mar 22, 2022)

Source code(tar.gz)
Source code(zip)
0.0.16(Mar 21, 2022)

Source code(tar.gz)
Source code(zip)
0.0.15(Mar 13, 2022)

Source code(tar.gz)
Source code(zip)
0.0.14(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.12(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.11(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.10(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.9(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.8(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.7(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.6(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.5(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.4(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.2(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.1(Mar 3, 2022)

Source code(tar.gz)
Source code(zip)

Owner

Phil Wang

Working with Attention. It's all we need

GitHub Repository

[AAAI 2021] EMLight: Lighting Estimation via Spherical Distribution Approximation and [ICCV 2021] Sparse Needlets for Lighting Estimation with Spherical Transport Loss

EMLight: Lighting Estimation via Spherical Distribution Approximation (AAAI 2021) Update 12/2021: We release our Virtual Object Relighting (VOR) Datas

144 Jan 06, 2023

Removing Inter-Experimental Variability from Functional Data in Systems Neuroscience

Removing Inter-Experimental Variability from Functional Data in Systems Neuroscience This repository is the official implementation of [https://www.bi

6 Oct 09, 2022

Lightweight mmm - Lightweight (Bayesian) Media Mix Model

Lightweight (Bayesian) Media Mix Model This is not an official Google product. L

342 Jan 03, 2023

deep-table implements various state-of-the-art deep learning and self-supervised learning algorithms for tabular data using PyTorch.

63 Oct 17, 2022

KaziText is a tool for modelling common human errors.

KaziText KaziText is a tool for modelling common human errors. It estimates probabilities of individual error types (so called aspects) from grammatic

3 Nov 24, 2022

Codebase for "ProtoAttend: Attention-Based Prototypical Learning."

Codebase for "ProtoAttend: Attention-Based Prototypical Learning." Authors: Sercan O. Arik and Tomas Pfister Paper: Sercan O. Arik and Tomas Pfister,

2 May 17, 2022

Implement slightly different caffe-segnet in tensorflow

Tensorflow-SegNet Implement slightly different (see below for detail) SegNet in tensorflow, successfully trained segnet-basic in CamVid dataset. Due t

364 Oct 27, 2022

Medical Insurance Cost Prediction using Machine earning

Medical-Insurance-Cost-Prediction-using-Machine-learning - Here in this project, I will use regression analysis to predict medical insurance cost for people in different regions, and based on several

1 Dec 27, 2021

Grammar Induction using a Template Tree Approach

Gitta Gitta ("Grammar Induction using a Template Tree Approach") is a method for inducing context-free grammars. It performs particularly well on data

36 Nov 15, 2022

A Protein-RNA Interface Predictor Based on Semantics of Sequences

PRIP PRIP：A Protein-RNA Interface Predictor Based on Semantics of Sequences installation gensim==3.8.3 matplotlib==3.1.3 xgboost==1.3.3 prettytable==2

0 Mar 25, 2022

OCR Streamlit App is used to extract text from images using python's easyocr, pytorch and streamlit packages

OCR-Streamlit-App OCR Streamlit App is used to extract text from images using python's easyocr, pytorch and streamlit packages OCR app gets an image a

5 Apr 05, 2022

Ultra-lightweight human body posture key point CNN model. ModelSize:2.3MB HUAWEI P40 NCNN benchmark: 6ms/img,

Ultralight-SimplePose Support NCNN mobile terminal deployment Based on MXNET(=1.5.1) GLUON(=0.7.0) framework Top-down strategy: The input image is t

223 Dec 27, 2022

Code for "AutoMTL: A Programming Framework for Automated Multi-Task Learning"

AutoMTL: A Programming Framework for Automated Multi-Task Learning This is the website for our paper "AutoMTL: A Programming Framework for Automated M

40 Dec 04, 2022

tf2-keras implement yolov5

YOLOv5 in tesnorflow2.x-keras yolov5数据增强jupyter示例 Bilibili视频讲解地址: 《yolov5 解读,训练,复现》 Bilibili视频讲解PPT文件: yolov5_bilibili_talk_ppt.pdf Bilibili视频讲解PPT文件:

254 Jan 08, 2023

PyTorch implementation of "ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context" (INTERSPEECH 2020)

ContextNet ContextNet has CNN-RNN-transducer architecture and features a fully convolutional encoder that incorporates global context information into

24 Nov 24, 2022

Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Related tags

Overview

Memory Efficient Attention Pytorch

Install

Usage

Citations

Comments

[feature request] Combining with flash attention?

i did this, we could build on top

Added dropout support to memory efficient variant

Making this work with relative position bias from XTransformers

save_for_backward can only save variables, but argument 5 is of type bool

Code:

ERROR:

Checkpointing is not compatible with .grad() or when an `inputs` parameter is passed to .backward()

Releases(0.1.1)

0.1.1(Dec 30, 2022)

0.1.0(Dec 30, 2022)

0.0.27(Nov 1, 2022)

0.0.26(Jul 23, 2022)

0.0.25(Jul 23, 2022)

0.0.24(Jul 23, 2022)

0.0.23(Jul 23, 2022)

0.0.22(Jul 23, 2022)

0.0.21(Jul 23, 2022)

0.0.20(Jul 23, 2022)

0.0.19(Jul 23, 2022)

0.0.18(Jul 23, 2022)

0.0.17(Mar 22, 2022)

0.0.16(Mar 21, 2022)

0.0.15(Mar 13, 2022)

0.0.14(Mar 4, 2022)

0.0.12(Mar 4, 2022)

0.0.11(Mar 4, 2022)

0.0.10(Mar 4, 2022)

0.0.9(Mar 4, 2022)

0.0.8(Mar 4, 2022)

0.0.7(Mar 4, 2022)

0.0.6(Mar 4, 2022)

0.0.5(Mar 4, 2022)

0.0.4(Mar 4, 2022)

0.0.2(Mar 4, 2022)

0.0.1(Mar 3, 2022)