[Hotfix][Pixtral] Fix multiple images bugs #8415

patrickvonplaten · 2024-09-12T13:29:45Z

FILL IN THE PR DESCRIPTION HERE

This PR fixes a couple bugs that arrive from using chunked pre-filling and previously incorrect image processing.

This PR makes sure that all images are pre-processed correctly and adds a bunch of aggressive tests.

github-actions · 2024-09-12T13:30:02Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

vllm/model_executor/models/pixtral.py

DarkLight1337 · 2024-09-12T13:49:20Z

Would be great if you could add a test to avoid similar regressions!

patrickvonplaten · 2024-09-12T16:06:25Z

tests/models/test_pixtral.py


 Run `pytest tests/models/test_mistral.py`.
 """
+import uuid


Add many more aggressive tests

DarkLight1337 · 2024-09-12T16:46:59Z

The tests fail to pass when I run them locally.

_______________________________________________________________________________ test_chat[bfloat16-8192-mistralai/Pixtral-12B-2409] ________________________________________________________________________________

vllm_runner = <class 'tests.conftest.VllmRunner'>, max_model_len = 8192, model = 'mistralai/Pixtral-12B-2409', dtype = 'bfloat16'

    @pytest.mark.parametrize("model", MODELS)
    @pytest.mark.parametrize("max_model_len", [8192, 65536])
    @pytest.mark.parametrize("dtype", ["bfloat16"])
    def test_chat(
        vllm_runner,
        max_model_len: int,
        model: str,
        dtype: str,
    ) -> None:
    
        with vllm_runner(model,
                         dtype=dtype,
                         tokenizer_mode="mistral",
                         enable_chunked_prefill=False,
                         max_model_len=max_model_len,
                         limit_mm_per_prompt=LIMIT_MM_PER_PROMPT) as vllm_model:
            results = []
            for msg in MSGS:
                outputs = vllm_model.model.chat(msg,
                                                sampling_params=SAMPLING_PARAMS)
    
                results.append(outputs[0].outputs[0].text)
    
>           assert results == EXPECTED
E           AssertionError: assert ['The image s... green park.'] == ['The image s... green park.']
E             
E             At index 1 diff: '1. A black dog with a curious expression sits on a wooden floor.\n2. A vast mountain range stretches across the horizon under a cloudy sky.' != '1. A black dog with floppy ears sits attentively on a wooden surface.\n2. A vast mountain range with rugged peaks stretches under a cloudy sky.'
E             Use -v to get more diff

tests/models/test_pixtral.py:114: AssertionError

patrickvonplaten · 2024-09-12T16:48:09Z

The tests fail to pass when I run them locally.

Hmm I see what device to you run them on? Tested on H100 - for me they are passing

patrickvonplaten · 2024-09-12T16:49:18Z

The tests fail to pass when I run them locally.

Hmm I see what device to you run them on? Tested on H100 - for me they are passing

Guess it's going to be quite difficult to get exactly the same results here across different devices and given that flash attention is not that deterministic of an operation - any tips on how to deal with this on the tests?

DarkLight1337 · 2024-09-12T16:49:32Z

The tests fail to pass when I run them locally.

Hmm I see what device to you run them on? Tested on H100 - for me they are passing

I'm running the test on a single L40. (I can't fit max_model_len=65536 so can only run the smaller tests)

patrickvonplaten · 2024-09-12T16:51:31Z

The tests fail to pass when I run them locally.

Hmm I see what device to you run them on? Tested on H100 - for me they are passing

I'm running the test on a single L40. (I can't fit max_model_len=65536 so can only run the smaller tests)

Ok yes that doesn't surprise me. From your test failure it seems like the first test case passes but then the longer / more complicated tests fails. We could add device-dependent expected values? Not sure. Wdyt?

DarkLight1337 · 2024-09-12T16:51:50Z

The tests fail to pass when I run them locally.

Hmm I see what device to you run them on? Tested on H100 - for me they are passing

I'm running the test on a single L40. (I can't fit max_model_len=65536 so can only run the smaller tests)

I think the output is still reasonable. Since the goal of this PR is to avoid crashing the model rather than having perfect output, we can reduce the number of tokens to match for now (at least until we have a HF version to test against).

For the test to be able to run in CI, we may need to split the model via tensor parallel. This is more complicated so I wouldn't enforce that in this PR.

patrickvonplaten · 2024-09-12T16:53:25Z

The tests fail to pass when I run them locally.

Hmm I see what device to you run them on? Tested on H100 - for me they are passing

I'm running the test on a single L40. (I can't fit max_model_len=65536 so can only run the smaller tests)

I think the output is still reasonable. Since the goal of this PR is to avoid crashing the model rather than having perfect output, we can reduce the number of tokens to match for now (at least until we have a HF version to test against).

I'll comment out the more extreme tests and for now will just them locally when needed

patrickvonplaten · 2024-09-12T16:58:41Z

Wrapped the more extreme tests in a is_h100 wrapper - does that work?

DarkLight1337 · 2024-09-12T17:01:47Z

Imo that's too device-specific. If you're able to extract the logprobs information, it would be better to check against the golden output via check_logprobs_close.

DarkLight1337 · 2024-09-12T17:02:41Z

As for the OOM issue, we can address that via TP in another PR.

patrickvonplaten · 2024-09-12T17:03:47Z

Imo that's too device-specific. If you're able to extract the logprobs information, it would be better to check against the golden output via check_logprobs_close.

Hmm what is the golden output for you then? Think vLLM should represent the official implementation here

DarkLight1337 · 2024-09-12T17:06:25Z

You can use your H100 output as the golden one. Checking the logprobs is less strict so the test should still pass on other devices.

patrickvonplaten · 2024-09-12T17:09:21Z

You can use your H100 output as the golden one. Checking the logprobs is less strict so the test should still pass on other devices.

Do you check that logprobs are within a range? If the output is different for temperature=0.0, the logprobs also quite certainly won't match no? In my experience it's quite difficult to get outputs to exactly match over different devices except for small inputs. I can try to extract the logprobs, but they also won't match between L40 and H100 I'm afraid

DarkLight1337 · 2024-09-12T17:11:55Z

You can use your H100 output as the golden one. Checking the logprobs is less strict so the test should still pass on other devices.

Do you check that logprobs are within a range? If the output is different for temperature=0.0, the logprobs also quite certainly won't match no? In my experience it's quite difficult to get outputs to exactly match over different devices except for small inputs. I can try to extract the logprobs, but they also won't match between L40 and H100 I'm afraid

We just check that for each token outputted by vLLM, the selected token is within the top-k logprobs of the reference (golden) output. You can use the check_logprobs_close utility function for this.

patrickvonplaten · 2024-09-12T17:13:00Z

You can use your H100 output as the golden one. Checking the logprobs is less strict so the test should still pass on other devices.

Do you check that logprobs are within a range? If the output is different for temperature=0.0, the logprobs also quite certainly won't match no? In my experience it's quite difficult to get outputs to exactly match over different devices except for small inputs. I can try to extract the logprobs, but they also won't match between L40 and H100 I'm afraid

We just check that for each token outputted by vLLM, the selected token is within the top-k logprobs of the reference (golden) output. You can use the check_logprobs_close utility function for this.

Gotcha!

patrickvonplaten · 2024-09-12T17:26:19Z

You can use your H100 output as the golden one. Checking the logprobs is less strict so the test should still pass on other devices.

Do you check that logprobs are within a range? If the output is different for temperature=0.0, the logprobs also quite certainly won't match no? In my experience it's quite difficult to get outputs to exactly match over different devices except for small inputs. I can try to extract the logprobs, but they also won't match between L40 and H100 I'm afraid

We just check that for each token outputted by vLLM, the selected token is within the top-k logprobs of the reference (golden) output. You can use the check_logprobs_close utility function for this.

Gotcha!

Actually sorry even this won't work because errors accumulate and then context changes and the ouput results is very different. E.g. for second example, I can get:

Gold output:

"1. A black dog with floppy ears sits attentively on a wooden surface.\n2. A vast mountain range with rugged peaks stretches under a cloudy sky.",

and

L40:

"1. A black dog with floppy ears sits attentively on a wooden surface.\n2. A vast mountain range stretches across the horizon under a cloudy sky."

Here the first different word is "with" vs. "stretches". "stretches" will be in the topk range of logprobs of "with" but the next words will not be. Currently each of the two tests run a simple test on every device where results match. Just the more difficult tests are only run on H100 so on L40 still two tests are run.

Thoughts?

tests/models/test_pixtral.py

DarkLight1337 · 2024-09-12T17:31:13Z

Hmm, from my understanding check_logprobs_close compares the logprobs only for the first mismatch, then exits early such that the test passes if those logprobs are consistent enough. The remaining tokens are skipped and thus should not fail the test.

…init_pixtral

…n/vllm into fix_init_pixtral

ywang96

I've run the new test on H100 and it all passed for me too, so I'm giving this a green light. As for the refactor work on the test so that it can run on the L40 machines on our CI, let's do that in a later PR given our timeline for the patch release.

Thanks for fixing!

patrickvonplaten · 2024-09-12T18:47:47Z

Just added the logprobs tests as explained by @DarkLight1337 - think it indeed makes more sense! Would be great if it passes also on a L40. Thanks for the reviews!

Signed-off-by: Alvant <[email protected]>

Signed-off-by: Amit Garg <[email protected]>

Signed-off-by: LeiWang1999 <[email protected]>

Fix init

d8b5d67

patrickvonplaten mentioned this pull request Sep 12, 2024

[Bug]: Pixtral inference not working correctly with LLMEngine/AsyncEngine #8411

Closed

1 task

DarkLight1337 reviewed Sep 12, 2024

View reviewed changes

vllm/model_executor/models/pixtral.py Outdated Show resolved Hide resolved

patrickvonplaten added 7 commits September 12, 2024 14:41

WIP

0710a3a

WIP

e4ad15d

WIP

8a7651f

again

fadee49

again

173b4d4

Up

654fef6

again

d756dda

patrickvonplaten commented Sep 12, 2024

View reviewed changes

patrickvonplaten changed the title ~~Fix Pixtral init~~ [Pixtral] Fix multiple images bugs Sep 12, 2024

patrickvonplaten changed the title ~~[Pixtral] Fix multiple images bugs~~ [Hotfix][Pixtral] Fix multiple images bugs Sep 12, 2024

DarkLight1337 requested a review from ywang96 September 12, 2024 16:48

patrickvonplaten added 2 commits September 12, 2024 18:57

again

273829f

again

cb4a43b

again

aa33b70

patrickvonplaten commented Sep 12, 2024

View reviewed changes

tests/models/test_pixtral.py Outdated Show resolved Hide resolved

Update tests/models/test_pixtral.py

dd559b7

patrickvonplaten added 2 commits September 12, 2024 17:45

Merge branch 'main' of https://github.com/vllm-project/vllm into fix_…

66f8492

…init_pixtral

Merge branch 'fix_init_pixtral' of https://github.com/patrickvonplate…

45ed17f

…n/vllm into fix_init_pixtral

simon-mo mentioned this pull request Sep 12, 2024

v0.6.1.post1 Release Tracker #8426

Closed

7 tasks

ywang96 approved these changes Sep 12, 2024

View reviewed changes

patrickvonplaten added 2 commits September 12, 2024 18:44

WIP

bd7bdcb

again

4bd4cfa

ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 12, 2024

ywang96 merged commit d31174a into vllm-project:main Sep 12, 2024
60 checks passed

DarkLight1337 mentioned this pull request Sep 13, 2024

[CI/Build] Update pixtral tests to use JSON #8436

Merged

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[Hotfix][Pixtral] Fix multiple images bugs (vllm-project#8415)

c4a0d45

Signed-off-by: Alvant <[email protected]>

garg-amit pushed a commit to garg-amit/vllm that referenced this pull request Oct 28, 2024

[Hotfix][Pixtral] Fix multiple images bugs (vllm-project#8415)

7328512

Signed-off-by: Amit Garg <[email protected]>

LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025

[Hotfix][Pixtral] Fix multiple images bugs (vllm-project#8415)

62ecdb4

Signed-off-by: LeiWang1999 <[email protected]>

Uh oh!

[Hotfix][Pixtral] Fix multiple images bugs #8415

[Hotfix][Pixtral] Fix multiple images bugs #8415

Conversation

patrickvonplaten commented Sep 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 12, 2024

Uh oh!

Uh oh!

DarkLight1337 commented Sep 12, 2024

Uh oh!

patrickvonplaten Sep 12, 2024

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 commented Sep 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

patrickvonplaten commented Sep 12, 2024

Uh oh!

patrickvonplaten commented Sep 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Sep 12, 2024

Uh oh!

patrickvonplaten commented Sep 12, 2024

Uh oh!

DarkLight1337 commented Sep 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

patrickvonplaten commented Sep 12, 2024

Uh oh!

patrickvonplaten commented Sep 12, 2024

Uh oh!

DarkLight1337 commented Sep 12, 2024

Uh oh!

DarkLight1337 commented Sep 12, 2024

Uh oh!

patrickvonplaten commented Sep 12, 2024

Uh oh!

DarkLight1337 commented Sep 12, 2024

Uh oh!

patrickvonplaten commented Sep 12, 2024

Uh oh!

DarkLight1337 commented Sep 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

patrickvonplaten commented Sep 12, 2024

Uh oh!

patrickvonplaten commented Sep 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

DarkLight1337 commented Sep 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ywang96 left a comment

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten commented Sep 12, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

patrickvonplaten commented Sep 12, 2024 •

edited

Loading

DarkLight1337 commented Sep 12, 2024 •

edited

Loading

patrickvonplaten commented Sep 12, 2024 •

edited

Loading

DarkLight1337 commented Sep 12, 2024 •

edited

Loading

DarkLight1337 commented Sep 12, 2024 •

edited

Loading

patrickvonplaten commented Sep 12, 2024 •

edited

Loading

DarkLight1337 commented Sep 12, 2024 •

edited

Loading