[Bugfix][Nixl] Fix full prefix cache hit bug #18632

robertgshaw2-redhat · 2025-05-23T20:24:54Z

SUMMARY:

in case of full prefix cache hit locally on D worker, we are leaking memory on the P worker side since we are not currently calling send_notif since we skip calling update_state_after_alloc
also fixes the path where we do get a cache hit, which was passing the wrong thing

Signed-off-by: [email protected] <[email protected]>

github-actions · 2025-05-23T20:25:03Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: [email protected] <[email protected]>

robertgshaw2-redhat · 2025-05-23T20:45:20Z

@njhill - can you let me know if this works okay with multi-connector?

njhill · 2025-05-23T23:08:15Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+                    # If remote_blocks and num_external_tokens = 0, we have
+                    # a full prefix cache hit on the D worker. We need to call
+                    # send_notif in _read_blocks to free the memory on the P.
+                    local_block_ids = (blocks.get_unhashed_block_ids()
+                                       if num_external_tokens > 0 else [])
                    # Get unhashed blocks to pull from remote.
                    self._reqs_need_recv[request.request_id] = (
-                        request, blocks.get_unhashed_block_ids())
+                        request, local_block_ids)


@robertgshaw2-redhat I'm still not sure that this part or the change to always call update_state_after_alloc is needed. I'd already added logic for this case in get_num_new_matched_tokens above:

vllm/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

Lines 215 to 222 in f203673

# NOTE: if count is 0 here, we have less than block_size

# tokens to pull after subtracting the local prefix cache hit.

# The remote only sends fully computed blocks, so there is

# nothing to transfer but we still need to notify the

# prefill worker so that the remote blocks are freed.

if all(p in params for p in ("remote_engine_id", "remote_host",

"remote_port")):

self._reqs_need_recv[request.request_id] = (request, [])

I can see that the other two fixes below in build_connector_meta and _read_blocks are of course needed though.

If you think it's better to have this logic in this method then we can remove it from the other one. But again I feel it's logically clearer to not call update_state_after_alloc if 0 was returned from get_num_new_matched_tokens.

I think that get_num_new_matched_tokens should be a pure function. Adding a side effect to it is surprising given the name of the method and the fact that we will have different behavior depending on what happens if the request is or is not able to be scheduled. This issue is actually causing a bug right now.

If allocate_slots returns None, the request will remain in the waiting queue. this will cause us to add the requests to reqs_need_recv more than one and as a result we will call read_blocks twice which will do a double free on the P worker side. Similarly this will happen if the request is preempted (it will get re-added to waiting). This is because we are not properly updating the request to have do_remote_prefill=False when it is added to reqs_need_recv from the get_num_new_matched_tokens function.

This is all just evidence that putting a side effect into this function is not a good idea. The update_state_after_alloc is where we should handle everything related to reqs_need_recv so we have a single place where all the logic is handled.

I removed those lines from get_num_new_matched_tokens

@robertgshaw2-redhat that makes sense, I agree about the pure function thing. I did also notice the fact that this could result in a double free on the P worker side in the case that it can't be scheduled, which isn't ideal (though I think would probably be harmless).

But to me, thinking from the pov of a generic connector interface, it still feels a bit odd given the connector isn't offering any tokens. I guess we should very clearly document the semantics and expectations for the interface.

A related quirk is that in the async load case, I think currently update_state_after_alloc will be called twice for a request (a second time once the request moves out of WAITING_FOR_REMOTE_KVS).

Signed-off-by: [email protected] <[email protected]>

robertgshaw2-redhat · 2025-05-24T13:41:13Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

            if count > 0:
                return count, True

-            # NOTE: if count is 0 here, we have less than block_size


this is now handled in update_state_after_alloc

njhill · 2025-05-25T20:51:58Z

@robertgshaw2-redhat changes will be needed to multi-connector too, I've pushed them to a branch, feel free to pull into this PR: njhill@4150a41

njhill

LGTM, with the multi-connector changes

…ix-cache-hit Signed-off-by: Nick Hill <[email protected]>

- Call get_num_new_matched_tokens for every connector - Call update_state_after_alloc for every connector, but with no blocks/tokens for all but the "chosen" connector (the first one to return non-zero tokens from get_num_new_matched_tokens). Signed-off-by: Nick Hill <[email protected]>

Signed-off-by: Juncheng Gu <[email protected]> Signed-off-by: Nick Hill <[email protected]>

Signed-off-by: Nick Hill <[email protected]>

njhill · 2025-06-03T14:17:19Z

I just have one more thing to fix up in the mulit-connector test now that the semantics have changed.

Signed-off-by: Nick Hill <[email protected]>

…ix-cache-hit

mergify · 2025-06-03T23:54:42Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @robertgshaw2-redhat.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…ix-cache-hit # Conflicts: # vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py

vllm/distributed/kv_transfer/kv_connector/v1/multi_connector.py

mgoin · 2025-06-04T21:26:25Z

/gemini review

gemini-code-assist

Code Review

This pull request effectively addresses a memory leak in the Nixl connector related to full prefix cache hits and also fixes an issue with how cache hits were handled. The core of the fix involves ensuring update_state_after_alloc is consistently called across all relevant connectors, even when no external tokens are loaded, which allows for proper notification and resource cleanup on the P worker side. The changes in MultiConnector and NixlConnectorScheduler are key to this. Test cases have been updated appropriately to reflect these changes and the enhanced logging.

Overall, the changes look good and directly target the described issues. I have one point for clarification regarding a behavioral change in MultiConnector.get_num_new_matched_tokens.

Summary of Findings

Memory Leak Fix in Nixl Connector: The primary goal of this PR, fixing a memory leak on the Nixl P-worker during full prefix cache hits on the D-worker, appears to be successfully addressed. The core changes ensure that update_state_after_alloc is called for all relevant connector components, allowing for proper notifications and resource cleanup.
Behavioral Change in MultiConnector.get_num_new_matched_tokens: The get_num_new_matched_tokens method in MultiConnector now iterates through all sub-connectors, calling the method on each, even if a match was found earlier. Clarification on the necessity and impact of this change would be beneficial.
Test Coverage and Logging: The tests in test_multi_connector.py have been updated to reflect the new logic and include more detailed event logging, which is good for verifying the fix and aiding future debugging.

Merge Readiness

The pull request seems to address the reported memory leak effectively. The changes are logical and the tests have been updated accordingly. There is one point regarding a behavioral change in MultiConnector.get_num_new_matched_tokens that would benefit from clarification. Assuming this behavior is intended and understood, the PR appears to be in good shape for merging after addressing or clarifying that point. As an AI, I am not authorized to approve pull requests; this assessment is based on the code review.

vllm/distributed/kv_transfer/kv_connector/v1/multi_connector.py

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

Signed-off-by: Nick Hill <[email protected]>

…ix-cache-hit

mergify · 2025-06-04T23:27:08Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @robertgshaw2-redhat.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…ix-cache-hit Signed-off-by: Nick Hill <[email protected]> # Conflicts: # vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]>

robertgshaw2-redhat added 2 commits May 23, 2025 20:19

updated

b15f974

Signed-off-by: [email protected] <[email protected]>

updated

95408aa

Signed-off-by: [email protected] <[email protected]>

robertgshaw2-redhat requested review from WoosukKwon, njhill, ywang96, comaniac and alexm-redhat as code owners May 23, 2025 20:24

mergify bot added the v1 label May 23, 2025

updated

61a2900

Signed-off-by: [email protected] <[email protected]>

njhill reviewed May 23, 2025

View reviewed changes

updated

6bde0f1

Signed-off-by: [email protected] <[email protected]>

robertgshaw2-redhat commented May 24, 2025

View reviewed changes

njhill approved these changes May 25, 2025

View reviewed changes

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label May 27, 2025

This was referenced Jun 2, 2025

[Bug]: NixlConnector should not skip short do_remote_prefill requests in connector metadata #18591

Closed

FIX: NixlConnector: do not skip short do_remote_prefill requests #18590

Closed

njhill and others added 5 commits June 2, 2025 14:03

Merge remote-tracking branch 'refs/remotes/origin/main' into fix-pref…

2c3cb80

…ix-cache-hit Signed-off-by: Nick Hill <[email protected]>

update test_prompt_less_than_block_size()

f6ed8c4

Signed-off-by: Juncheng Gu <[email protected]> Signed-off-by: Nick Hill <[email protected]>

revert test changes

c5546c3

Signed-off-by: Nick Hill <[email protected]>

fix multi_connector import

fb844a5

Signed-off-by: Nick Hill <[email protected]>

njhill mentioned this pull request Jun 3, 2025

[WIP] [Core][P/D] CPU connector for PD disagg #18332

Open

15 tasks

njhill added 2 commits June 3, 2025 13:23

fix multi_connector test

9e435e0

Signed-off-by: Nick Hill <[email protected]>

Merge remote-tracking branch 'refs/remotes/origin/main' into fix-pref…

354d775

…ix-cache-hit

mergify bot added the needs-rebase label Jun 3, 2025

Merge remote-tracking branch 'refs/remotes/origin/main' into fix-pref…

45bd917

…ix-cache-hit # Conflicts: # vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

mergify bot removed the needs-rebase label Jun 4, 2025

njhill requested a review from mgoin June 4, 2025 16:20

mgoin reviewed Jun 4, 2025

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py Outdated Show resolved Hide resolved

vllm/distributed/kv_transfer/kv_connector/v1/multi_connector.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Jun 4, 2025

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/v1/multi_connector.py Outdated Show resolved Hide resolved

vllm/distributed/kv_transfer/kv_connector/v1/multi_connector.py Show resolved Hide resolved

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py Show resolved Hide resolved

njhill added 2 commits June 4, 2025 14:54

address @mgoin review comments

0c30192

Signed-off-by: Nick Hill <[email protected]>

Merge remote-tracking branch 'refs/remotes/origin/main' into fix-pref…

cac5027

…ix-cache-hit

njhill enabled auto-merge (squash) June 4, 2025 22:26

mgoin approved these changes Jun 4, 2025

View reviewed changes

mergify bot added the needs-rebase label Jun 4, 2025

Merge remote-tracking branch 'refs/remotes/origin/main' into fix-pref…

3eaea72

…ix-cache-hit Signed-off-by: Nick Hill <[email protected]> # Conflicts: # vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

mergify bot removed the needs-rebase label Jun 4, 2025

njhill merged commit c56ed8b into vllm-project:main Jun 5, 2025
70 checks passed

	# NOTE: if count is 0 here, we have less than block_size
	# tokens to pull after subtracting the local prefix cache hit.
	# The remote only sends fully computed blocks, so there is
	# nothing to transfer but we still need to notify the
	# prefill worker so that the remote blocks are freed.
	if all(p in params for p in ("remote_engine_id", "remote_host",
	"remote_port")):
	self._reqs_need_recv[request.request_id] = (request, [])

Uh oh!

[Bugfix][Nixl] Fix full prefix cache hit bug #18632

[Bugfix][Nixl] Fix full prefix cache hit bug #18632

Uh oh!

Conversation

robertgshaw2-redhat commented May 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 23, 2025

Uh oh!

robertgshaw2-redhat commented May 23, 2025

Uh oh!

njhill May 23, 2025

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat May 24, 2025

Choose a reason for hiding this comment

Uh oh!

njhill May 25, 2025

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat May 24, 2025

Choose a reason for hiding this comment

Uh oh!

njhill commented May 25, 2025

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

njhill commented Jun 3, 2025

Uh oh!

mergify bot commented Jun 3, 2025

Uh oh!

Uh oh!

Uh oh!

mgoin commented Jun 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Jun 4, 2025

Uh oh!

Uh oh!

Uh oh!

robertgshaw2-redhat commented May 23, 2025 •

edited by github-actions bot

Loading

robertgshaw2-redhat May 24, 2025 •

edited

Loading