-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[Bugfix][Hardware][CPU] Fix broken encoder-decoder CPU runner #10218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Isotr0py <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Can you take a look at why the CI still passed before, and fix it so that failing tests actually fail the CI? |
It is due to the CPU CI is set as soft-failed. |
The CI didn't even soft fail here: https://buildkite.com/vllm/ci-aws/builds/11064#01931a18-e48d-485d-b357-f5f995bc474f |
Perhaps make |
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
I added an intentional failing test to CPU test pipeline, let's see if it can be caught by the CI after adding |
Added ready label to trigger Intel CPU tests |
Signed-off-by: Isotr0py <[email protected]>
Seems that adding |
Signed-off-by: Isotr0py <[email protected]>
Classification model tests are failing now: https://buildkite.com/vllm/ci-aws/builds/11116#01932014-a8a8-4bbf-91a0-6ba08aa7cde8 Looks like the vLLM output is wrong. |
Hmmm, this is odd, because I can't reproduce it with Update: I can reproduce this once after several runs. |
Seems that the failing tests only occur after several runs:
|
Does the failure occur randomly? |
Yes, and it seems that Intel CPU test in some new PRs is passing now: https://buildkite.com/vllm/ci-aws/builds/11137#0193231d-132d-40a9-9bfe-dfa5a1f05da0 |
This is odd... can you try setting |
Setting
|
Maybe there's something wrong with the softmax? It's really strange that the output is exactly 0 and 1... |
I prefer this is an issue about the test itself, because only this test failed randomly and seems that running Perhaps it's because of the runner order? We run hf_runner before vllm_runner in this test. |
Let's try swapping the order. |
Seems that there is a suspicious overflow occured in score layer when the test is failing. Here is the logits tensor when test failed:
The overflow only occurred in the first column, while the last column is normal. Note that the |
…roject#10218) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Sumit Dubey <[email protected]>
…roject#10218) Signed-off-by: Isotr0py <[email protected]>
…roject#10218) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: LeiWang1999 <[email protected]>
Uh oh!
There was an error while loading. Please reload this page.