Memcpy kernel for flash attention #29

suquark · 2023-04-06T08:52:22Z

Memcpy kernel for flash attention

num_tokens: 64, num_heads: 40, head_size: 128, block_size: 8, num_blocks: 1024, dtype: torch.float16
[Latency] gather_cached_kv: 0.008 ms
[Throughput] gather_cached_kv: 156.479 GB/s
num_tokens: 128, num_heads: 40, head_size: 128, block_size: 8, num_blocks: 1024, dtype: torch.float16
[Latency] gather_cached_kv: 0.011 ms
[Throughput] gather_cached_kv: 216.171 GB/s
num_tokens: 256, num_heads: 40, head_size: 128, block_size: 8, num_blocks: 1024, dtype: torch.float16
[Latency] gather_cached_kv: 0.032 ms
[Throughput] gather_cached_kv: 152.631 GB/s
num_tokens: 512, num_heads: 40, head_size: 128, block_size: 8, num_blocks: 1024, dtype: torch.float16
[Latency] gather_cached_kv: 0.057 ms
[Throughput] gather_cached_kv: 172.325 GB/s
num_tokens: 1024, num_heads: 40, head_size: 128, block_size: 8, num_blocks: 1024, dtype: torch.float16
[Latency] gather_cached_kv: 0.104 ms
[Throughput] gather_cached_kv: 187.537 GB/s
num_tokens: 2048, num_heads: 40, head_size: 128, block_size: 8, num_blocks: 1024, dtype: torch.float16
[Latency] gather_cached_kv: 0.204 ms
[Throughput] gather_cached_kv: 191.603 GB/s

The performance is pretty good (theoretical optimal throughput is 1.6TB/s for A100-40GB), considering the memory layout is not ideal.

result for unoptimized kernel:

num_tokens: 64, num_heads: 40, head_size: 128, block_size: 8, num_blocks: 1024, dtype: torch.float16
[Latency] gather_cached_kv: 0.010 ms
[Throughput] gather_cached_kv: 125.891 GB/s
num_tokens: 128, num_heads: 40, head_size: 128, block_size: 8, num_blocks: 1024, dtype: torch.float16
[Latency] gather_cached_kv: 0.015 ms
[Throughput] gather_cached_kv: 160.678 GB/s
num_tokens: 256, num_heads: 40, head_size: 128, block_size: 8, num_blocks: 1024, dtype: torch.float16
[Latency] gather_cached_kv: 0.032 ms
[Throughput] gather_cached_kv: 150.732 GB/s
num_tokens: 512, num_heads: 40, head_size: 128, block_size: 8, num_blocks: 1024, dtype: torch.float16
[Latency] gather_cached_kv: 0.060 ms
[Throughput] gather_cached_kv: 162.482 GB/s
num_tokens: 1024, num_heads: 40, head_size: 128, block_size: 8, num_blocks: 1024, dtype: torch.float16
[Latency] gather_cached_kv: 0.108 ms
[Throughput] gather_cached_kv: 180.763 GB/s
num_tokens: 2048, num_heads: 40, head_size: 128, block_size: 8, num_blocks: 1024, dtype: torch.float16
[Latency] gather_cached_kv: 0.206 ms
[Throughput] gather_cached_kv: 189.757 GB/s

the optimized kernel works much better for smaller number of tokens (+20% speedup)

suquark · 2023-04-06T08:56:52Z

implementation is done. need testing (will do it on Thursday)

the memory saving strategy is orthogonal to this kernel, so I would not include it in this PR

optimize with shared memory better number of threads update test temp disable test update

WoosukKwon · 2023-04-09T23:42:38Z

Hey @suquark thanks for the PR! I have a quick question: have you also measured the performance diff between the two kernels before and after the optimization?

suquark · 2023-04-11T00:45:02Z

see the PR comment for the optimized kernel performance comparison

WoosukKwon

LGTM.

* optimize * add benchmark * add assert * add test

Update optimum-intel

It's faster Signed-off-by: Nick Hill <[email protected]>

Adding fp8 gemm computation

sync release with IBM/release

…ack_acc_bf16 fix linear init impacts on generation

Add official doc index. Move the release content to the right place. Signed-off-by: wangxiyuan <[email protected]>

* Fix truncated output Signed-off-by: Woosuk Kwon <[email protected]> * fix Signed-off-by: Woosuk Kwon <[email protected]> --------- Signed-off-by: Woosuk Kwon <[email protected]>

suquark changed the title ~~Memcpy for flashattn~~ Memcpy kernel for flashattn Apr 6, 2023

suquark changed the title ~~Memcpy kernel for flashattn~~ Memcpy kernel for flash attention Apr 6, 2023

suquark requested a review from WoosukKwon April 6, 2023 08:57

suquark force-pushed the memcpy4flashattn branch from 678bb06 to 07e9891 Compare April 8, 2023 20:40

optimize

e21845e

optimize with shared memory better number of threads update test temp disable test update

suquark force-pushed the memcpy4flashattn branch from 07e9891 to e21845e Compare April 8, 2023 20:45

suquark added 15 commits April 9, 2023 01:49

update

82fc4f4

update test

6057f9f

update

075b48a

update API

a615750

update

3f3991d

update

288240c

cleanup

127027d

add benchmark

096c04b

update

b8ac649

fix

278c4b0

fix

0e09cc8

optimization

72d7053

revert changes to benchmarks

c3a2e87

update

c5725d4

rename and assert

70b51aa

suquark closed this Apr 11, 2023

WoosukKwon reopened this Apr 11, 2023

WoosukKwon approved these changes Apr 11, 2023

View reviewed changes

suquark merged commit e3cec88 into main Apr 11, 2023

suquark deleted the memcpy4flashattn branch April 11, 2023 01:22

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Memcpy kernel for flash attention (vllm-project#29)

b4f0ef4

* optimize * add benchmark * add assert * add test

luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Apr 17, 2024

Merge pull request vllm-project#29 from ilya-lavrenov/update-optimum

f73cfd2

Update optimum-intel

dtrifiro pushed a commit to dtrifiro/vllm that referenced this pull request May 21, 2024

Use fork for worker multiprocessing method (vllm-project#29)

9499dce

It's faster Signed-off-by: Nick Hill <[email protected]>

yuhuixu1993 mentioned this pull request Jun 2, 2024

[Bug]: loading squeezellm model #5190

Closed

tianyil1 pushed a commit to tianyil1/vllm that referenced this pull request Jun 5, 2024

Add high-level profiler (vllm-project#29)

7f7500b

fxmarty pushed a commit to fxmarty/vllm-public that referenced this pull request Jun 12, 2024

Merge pull request vllm-project#29 from ROCm/charlifu/fp8_wo_upstream

ed31c00

Adding fp8 gemm computation

dtrifiro pushed a commit to dtrifiro/vllm that referenced this pull request Jun 21, 2024

Merge pull request vllm-project#29 from dtrifiro/sync-release-with-ibm

e392b03

sync release with IBM/release

ZHJ19970917 mentioned this pull request Jul 14, 2024

[Bug]: When using qwen-32b-chat-awq with multi-threaded access, errors occur after approximately several hundred visits.”vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.“ #6421

Closed

bigPYJ1151 pushed a commit to bigPYJ1151/vllm that referenced this pull request Jul 31, 2024

Merge pull request vllm-project#29 from intel-sandbox/fix_linear_prep…

8b766f8

…ack_acc_bf16 fix linear init impacts on generation

alixiaodi mentioned this pull request Aug 2, 2024

[Bug]: #7072

Closed

wuhuikx pushed a commit to wuhuikx/vllm that referenced this pull request Mar 27, 2025

[Docs] Add official doc index (vllm-project#29)

51eadc6

Add official doc index. Move the release content to the right place. Signed-off-by: wangxiyuan <[email protected]>

hao-cold mentioned this pull request May 13, 2025

[Bug]: CUDA error: an illegal instruction was encountered #18045

Closed

1 task

markmc mentioned this pull request May 21, 2025

[Bug][Failing Test]: Distributed Comm Ops - distributed/test_shm_broadcast.py #18492

Closed

1 task

simon-mo mentioned this pull request May 22, 2025

[BugFix] Re-enable Blackwell #18563

Closed

zerosurplus mentioned this pull request Jun 16, 2025

[Bug]: torch.distributed.DistNetworkError: The client socket has timed out after 600000ms while trying to connect to (172.17.0.9, 46229). #19670

Open

1 task

xiaomofang mentioned this pull request Jul 31, 2025

[Bug]: There is an issue with speculative inference in Eagle mode, where the context length of vLLM inference is constrained by the draft model. #21986

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Memcpy kernel for flash attention #29

Memcpy kernel for flash attention #29

Uh oh!

suquark commented Apr 6, 2023 •

edited

Loading

Uh oh!

suquark commented Apr 6, 2023 •

edited

Loading

Uh oh!

WoosukKwon commented Apr 9, 2023

Uh oh!

suquark commented Apr 11, 2023 •

edited

Loading

Uh oh!

WoosukKwon left a comment

Uh oh!

Uh oh!

Uh oh!

Memcpy kernel for flash attention #29

Memcpy kernel for flash attention #29

Uh oh!

Conversation

suquark commented Apr 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

suquark commented Apr 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WoosukKwon commented Apr 9, 2023

Uh oh!

suquark commented Apr 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

suquark commented Apr 6, 2023 •

edited

Loading

suquark commented Apr 6, 2023 •

edited

Loading

suquark commented Apr 11, 2023 •

edited

Loading