[V1] [Kernel] Change KV cache layout to (num_blocks, 2, ...) for FlashAttention backend #21549

tdoublep · 2025-07-24T18:24:33Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

This PR changes the layout of the FlashAttention backend so it matches the "FlashInfer layout" (num_blocks, 2, ...) rather than (2, num_blocks, ...). This make memory management for hybrid models significantly easier.

I tried to make the changes to the FlexAttention backend too but ran into some problems, perhaps we could address that separately. cc @LucasWilkinson

Triton backend changes are already addressed by #21197

I have tried to change the nixl logic accordingly but could definitely use your eyes on it @NickLucche

Test Plan

$ VLLM_ATTENTION_BACKEND=FLASH_ATTN_V1 lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct --tasks gsm8k --num_fewshot 5 --batch_size auto --limit 500

Test Result

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.792|±  |0.0182|
|     |       |strict-match    |     5|exact_match|↑  |0.770|±  |0.0188|

(Optional) Documentation Update

Signed-off-by: Thomas Parnell <[email protected]>

github-actions · 2025-07-24T18:24:55Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request refactors the KV cache layout from (2, num_blocks, ...) to (num_blocks, 2, ...) across various attention backends. The changes in flash_attn.py and flex_attention.py are consistent and correctly update the unbind operation to match the new layout. However, in triton_attn.py, the conditional logic to support both old and new layouts introduces a critical bug. When VLLM_V1_USE_PREFILL_DECODE_ATTENTION is enabled, get_kv_cache_shape returns a 5D tensor for the old layout path, which is incompatible with PagedAttention.split_kv_cache that expects a 3D tensor. This will lead to a runtime error. A fix is proposed to return the correctly shaped 3D tensor for this path.

vllm/v1/attention/backends/triton_attn.py

Signed-off-by: Thomas Parnell <[email protected]>

mergify · 2025-07-26T13:14:29Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tdoublep.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Thomas Parnell <[email protected]>

tdoublep · 2025-07-30T16:13:06Z

I have debugged few more issues with failing tests. I expect this next build to pass all CI tests.

cc @LucasWilkinson @NickLucche @tlrmchlsmth

NickLucche · 2025-07-31T10:39:07Z

We weren't so lucky with models/test_initialization.py::test_can_initialize[HunYuanDenseV1ForCausalLM], but let's check if it's related or just a version mismatch

tdoublep · 2025-08-01T06:58:02Z

@NickLucche That error seems to be fixed on main; trying again now.

tdoublep · 2025-08-01T10:28:51Z

I don't understand why the entrypoints-test-api-server keep failing. If I try and run the failing tests locally, they are passing. @NickLucche @LucasWilkinson

update: it appears to be running into CudaOOM in the CI (which I think runs on L4), which explains why they are passing locally for me. it does suggest a real problem, this PR shouldn't change the memory usage....will debug

tdoublep · 2025-08-01T13:04:50Z

I tried running the same tests on L4 GPU locally and they pass, so something is strange here?

NickLucche · 2025-08-02T10:05:34Z

We might have better luck this time. These tests are being a real nuisance lately.

Anyways, to recap situation, we're still waiting for #20189 to land to avoid breaking llmd integration.

tdoublep · 2025-08-02T10:54:37Z

Yeah. I don't know what is up with the tests at the moment. I just wanted to see if this change works in principle. Let's make sure #20189 lands before we merge this one.

tdoublep · 2025-08-03T14:35:34Z

Some of these distributed errors look legit - will investigate

tdoublep · 2025-08-07T21:43:41Z

I've gone through the (many) failing CI checks but all of them look like things that have either been fixed on main in the last day or so, or things that are unrelated.

Signed-off-by: Thomas Parnell <[email protected]>

mgoin · 2025-08-10T18:23:51Z

Have there been any performance checks for this change?

tdoublep · 2025-08-13T08:30:12Z

@mgoin No, we need to do that. Still looks like there are some correctness issues I need to address first though given all the failing CI tests.

There is also #20189 that we need to land first to prevent breaking llm-d integration.

mergify · 2025-09-19T18:30:57Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tdoublep.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Chnage flash_attn layout

7811b1c

Signed-off-by: Thomas Parnell <[email protected]>

mergify bot added the v1 label Jul 24, 2025

gemini-code-assist bot reviewed Jul 24, 2025

View reviewed changes

vllm/v1/attention/backends/triton_attn.py Outdated Show resolved Hide resolved

tdoublep force-pushed the flash-mem-layout branch from 51736a5 to 7811b1c Compare July 24, 2025 19:04

Make corresponding changes in nixl_conector

4b77b28

Signed-off-by: Thomas Parnell <[email protected]>

tdoublep changed the title ~~[V1] [Kernel] Change KV cache layout to (num_blocks, 2, ...) for all attention backends~~ [V1] [Kernel] Change KV cache layout to (num_blocks, 2, ...) for FlashAttention backend Jul 24, 2025

tdoublep marked this pull request as ready for review July 24, 2025 19:15

tdoublep requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners July 24, 2025 19:16

LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 24, 2025

mergify bot added the needs-rebase label Jul 26, 2025

tdoublep added 2 commits July 29, 2025 10:17

Resolve merge conflicts

cfce88d

Signed-off-by: Thomas Parnell <[email protected]>

Fix test

a37efba

Signed-off-by: Thomas Parnell <[email protected]>

mergify bot removed the needs-rebase label Jul 29, 2025

tdoublep added 2 commits July 30, 2025 12:00

Fix mock-up block len in FakeNixlConnectorWorker

268916e

Signed-off-by: Thomas Parnell <[email protected]>

Adjust code to extract and inject kv cache

eb242f0

Signed-off-by: Thomas Parnell <[email protected]>

Merge branch 'main' into flash-mem-layout

3e15e11

tdoublep mentioned this pull request Aug 1, 2025

[V1] [Hybrid] Validate compatibility of attention backend batch reordering at init time #21557

Merged

4 tasks

Merge branch 'main' into flash-mem-layout

6a83d36

Merge branch 'main' into flash-mem-layout

86a621a

tdoublep mentioned this pull request Aug 3, 2025

[V1] [Hybrid] Support Minimax-Text-01 in V1 #22151

Merged

4 tasks

tdoublep added 2 commits August 6, 2025 17:16

Merge branch 'main' into flash-mem-layout

73c1893

Merge branch 'main' into flash-mem-layout

3450bb2

bringlein mentioned this pull request Aug 8, 2025

[V1][Hybrid][Backend] Allow different data types in Hybrid Cache Manager #22544

Closed

4 tasks

tdoublep mentioned this pull request Aug 10, 2025

[V1] [Hybrid] Enable compile and piecewise CUDA graph for MiniMax-Text models #22589

Merged

4 tasks

tdoublep added 2 commits August 10, 2025 08:48

Merge branch 'main' into flash-mem-layout

e4a334f

Fix issue with gpu_model_runner test

810a8d5

Signed-off-by: Thomas Parnell <[email protected]>

tdoublep mentioned this pull request Aug 10, 2025

[V1] [Hybrid] Enable Full CUDA graph by default for hybrid models in V1 #22594

Merged

4 tasks

mergify bot added the kv-connector label Sep 19, 2025

mergify bot added the needs-rebase label Sep 19, 2025

Uh oh!

[V1] [Kernel] Change KV cache layout to (num_blocks, 2, ...) for FlashAttention backend #21549

Are you sure you want to change the base?

[V1] [Kernel] Change KV cache layout to (num_blocks, 2, ...) for FlashAttention backend #21549

Conversation

tdoublep commented Jul 24, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Jul 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mergify bot commented Jul 26, 2025

Uh oh!

tdoublep commented Jul 30, 2025

Uh oh!

NickLucche commented Jul 31, 2025

Uh oh!

tdoublep commented Aug 1, 2025

Uh oh!

tdoublep commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdoublep commented Aug 1, 2025

Uh oh!

NickLucche commented Aug 2, 2025

Uh oh!

tdoublep commented Aug 2, 2025

Uh oh!

tdoublep commented Aug 3, 2025

Uh oh!

tdoublep commented Aug 7, 2025

Uh oh!

mgoin commented Aug 10, 2025

Uh oh!

tdoublep commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Sep 19, 2025

Uh oh!

Uh oh!

tdoublep commented Jul 24, 2025 •

edited by github-actions bot

Loading

tdoublep commented Aug 1, 2025 •

edited

Loading

tdoublep commented Aug 13, 2025 •

edited

Loading