-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[Bugfix] Fix Stream usage in CPU model runner and OneDNN kernel check #25046
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: jiang1.li <[email protected]>
Signed-off-by: jiang1.li <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces fixes for CPU-only execution. It correctly stubs out torch.cuda.Stream
to prevent crashes in the CPU model runner and disables the unsupported Dual-Batch Overlap (DBO) feature on CPU. It also attempts to fix an issue with torch.compile
by relaxing a contiguity check in the OneDNN matrix multiplication kernel. However, this change in the OneDNN kernel is likely to cause memory corruption, as it allows a non-contiguous output tensor without passing its memory layout (strides) to the underlying implementation. This is a critical issue that must be addressed.
TORCH_CHECK(a.dim() == 2); | ||
TORCH_CHECK(a.stride(-1) == 1); | ||
TORCH_CHECK(c.is_contiguous()); | ||
TORCH_CHECK(c.stride(-1) == 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relaxing the check from c.is_contiguous()
to c.stride(-1) == 1
without providing the full tensor strides to the underlying OneDNN kernel is dangerous and will likely lead to memory corruption.
When c
is not contiguous (e.g., it's a view of a larger tensor, which can be the case with torch.compile
's tensor reuse), its rows are not packed together in memory. The MatMulPrimitiveHandler
receives c.data_ptr()
but does not appear to receive the stride for c
's first dimension (unlike for tensor a
, where a.stride(0)
is passed via exec_args
).
Without the stride information, the kernel will write output rows assuming a contiguous layout, overwriting memory that does not belong to c
. This can cause silent data corruption and difficult-to-debug crashes.
To fix this correctly, you must either:
- Pass the strides of
c
toMatMulPrimitiveHandler
and ensure the OneDNN primitive is configured to use them. This would likely involve addingc.stride(0)
toMatMulPrimitiveHandler::ExecArgs
. - If modifying the handler is not feasible, you should enforce contiguity. Instead of relaxing the check, you could create a temporary contiguous tensor for the output and copy it back to
c
ifc
was not originally contiguous.
Given the potential for silent memory corruption, this is a critical issue.
…vllm-project#25046) Signed-off-by: jiang1.li <[email protected]>
…vllm-project#25046) Signed-off-by: jiang1.li <[email protected]> Signed-off-by: charlifu <[email protected]>
…vllm-project#25046) Signed-off-by: jiang1.li <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
…vllm-project#25046) Signed-off-by: jiang1.li <[email protected]>
Purpose
torch.cuda.Stream
to avoid break on CPU backendTest Plan
CI tests
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.