-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[V1] [Kernel] Change KV cache layout to (num_blocks, 2, ...) for FlashAttention backend #21549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Thomas Parnell <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the KV cache layout from (2, num_blocks, ...)
to (num_blocks, 2, ...)
across various attention backends. The changes in flash_attn.py
and flex_attention.py
are consistent and correctly update the unbind
operation to match the new layout. However, in triton_attn.py
, the conditional logic to support both old and new layouts introduces a critical bug. When VLLM_V1_USE_PREFILL_DECODE_ATTENTION
is enabled, get_kv_cache_shape
returns a 5D tensor for the old layout path, which is incompatible with PagedAttention.split_kv_cache
that expects a 3D tensor. This will lead to a runtime error. A fix is proposed to return the correctly shaped 3D tensor for this path.
51736a5
to
7811b1c
Compare
Signed-off-by: Thomas Parnell <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
I have debugged few more issues with failing tests. I expect this next build to pass all CI tests. |
We weren't so lucky with |
@NickLucche That error seems to be fixed on main; trying again now. |
I don't understand why the update: it appears to be running into CudaOOM in the CI (which I think runs on L4), which explains why they are passing locally for me. it does suggest a real problem, this PR shouldn't change the memory usage....will debug |
I tried running the same tests on L4 GPU locally and they pass, so something is strange here? |
We might have better luck this time. These tests are being a real nuisance lately. Anyways, to recap situation, we're still waiting for #20189 to land to avoid breaking llmd integration. |
Yeah. I don't know what is up with the tests at the moment. I just wanted to see if this change works in principle. Let's make sure #20189 lands before we merge this one. |
Some of these distributed errors look legit - will investigate |
I've gone through the (many) failing CI checks but all of them look like things that have either been fixed on main in the last day or so, or things that are unrelated. |
Signed-off-by: Thomas Parnell <[email protected]>
Have there been any performance checks for this change? |
This pull request has merge conflicts that must be resolved before it can be |
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
This PR changes the layout of the FlashAttention backend so it matches the "FlashInfer layout"
(num_blocks, 2, ...)
rather than(2, num_blocks, ...)
. This make memory management for hybrid models significantly easier.I tried to make the changes to the FlexAttention backend too but ran into some problems, perhaps we could address that separately. cc @LucasWilkinson
Triton backend changes are already addressed by #21197
I have tried to change the nixl logic accordingly but could definitely use your eyes on it @NickLucche
Test Plan
Test Result
(Optional) Documentation Update