-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[Bug] Fix Long Context OOM Issue #25290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Fix Long Context OOM Issue #25290
Conversation
Signed-off-by: yewentao256 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a critical Out-Of-Memory (OOM) error for long-context inference with Multi-Layer Attention (MLA) by reducing the chunked_prefill_workspace_size
. While the fix is correct in principle, I've identified a potential issue where the change could lead to an AssertionError
with certain configurations, causing a crash. I've provided a suggestion to make the logic more robust and prevent this failure. Overall, a good fix for the OOM problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; thanks!
Signed-off-by: yewentao256 <[email protected]>
On a DP=16 prefill B200 deepseek v3.1 config, where i should be able to handle 9 full length context requests per DP, I'm now hitting the assertion https://github.com/vllm-project/vllm/blame/273690a50ac2a5fa79fa7acc5077e49aa1af427e/vllm/v1/attention/backends/mla/common.py#L485:
Reducing from 128k to 64k under the I expected after this change to be able to start this config with a long deepseekv3 context, but instead it exits immediately. I also can't start 65536 max tokens (not sure why) |
Signed-off-by: yewentao256 <[email protected]>
Signed-off-by: yewentao256 <[email protected]> Signed-off-by: charlifu <[email protected]>
Signed-off-by: yewentao256 <[email protected]>
Signed-off-by: yewentao256 <[email protected]> Signed-off-by: gaojc <[email protected]>
Signed-off-by: yewentao256 <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Signed-off-by: yewentao256 <[email protected]>
Purpose
Context from @smarterclayton
The main reason is we allocated too much mem for MLA chunk padding, this PR fixes the issue.
Note: As the comments said,
We should assign
64 * 1024
instead of128 * 1024
here as well, so this PR also fixes the consistency between comments and code.**The OOM issue is reasonable if we have even more context length using limited GPU memory, considering add
tp
or reduce --gpu-memory-utilization 0.9 to a smaller number when OOM. **Test
Now it is fixed.