-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[Hotfix][Pixtral] Fix multiple images bugs #8415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Hotfix][Pixtral] Fix multiple images bugs #8415
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Would be great if you could add a test to avoid similar regressions! |
Run `pytest tests/models/test_mistral.py`. | ||
""" | ||
import uuid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add many more aggressive tests
The tests fail to pass when I run them locally.
|
Hmm I see what device to you run them on? Tested on H100 - for me they are passing |
Guess it's going to be quite difficult to get exactly the same results here across different devices and given that flash attention is not that deterministic of an operation - any tips on how to deal with this on the tests? |
I'm running the test on a single L40. (I can't fit |
Ok yes that doesn't surprise me. From your test failure it seems like the first test case passes but then the longer / more complicated tests fails. We could add device-dependent expected values? Not sure. Wdyt? |
I think the output is still reasonable. Since the goal of this PR is to avoid crashing the model rather than having perfect output, we can reduce the number of tokens to match for now (at least until we have a HF version to test against). For the test to be able to run in CI, we may need to split the model via tensor parallel. This is more complicated so I wouldn't enforce that in this PR. |
I'll comment out the more extreme tests and for now will just them locally when needed |
Wrapped the more extreme tests in a |
Imo that's too device-specific. If you're able to extract the logprobs information, it would be better to check against the golden output via |
As for the OOM issue, we can address that via TP in another PR. |
Hmm what is the golden output for you then? Think vLLM should represent the official implementation here |
You can use your H100 output as the golden one. Checking the logprobs is less strict so the test should still pass on other devices. |
Do you check that logprobs are within a range? If the output is different for temperature=0.0, the logprobs also quite certainly won't match no? In my experience it's quite difficult to get outputs to exactly match over different devices except for small inputs. I can try to extract the logprobs, but they also won't match between L40 and H100 I'm afraid |
We just check that for each token outputted by vLLM, the selected token is within the top-k logprobs of the reference (golden) output. You can use the |
Gotcha! |
Actually sorry even this won't work because errors accumulate and then context changes and the ouput results is very different. E.g. for second example, I can get: Gold output:
and L40:
Here the first different word is Thoughts? |
Hmm, from my understanding |
…n/vllm into fix_init_pixtral
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've run the new test on H100 and it all passed for me too, so I'm giving this a green light. As for the refactor work on the test so that it can run on the L40 machines on our CI, let's do that in a later PR given our timeline for the patch release.
Thanks for fixing!
Just added the logprobs tests as explained by @DarkLight1337 - think it indeed makes more sense! Would be great if it passes also on a L40. Thanks for the reviews! |
Signed-off-by: Alvant <[email protected]>
Signed-off-by: Amit Garg <[email protected]>
Signed-off-by: LeiWang1999 <[email protected]>
FILL IN THE PR DESCRIPTION HERE
FIX #8382
FIX #8411
This PR fixes a couple bugs that arrive from using chunked pre-filling and previously incorrect image processing.
This PR makes sure that all images are pre-processed correctly and adds a bunch of aggressive tests.