Skip to content

Conversation

patrickvonplaten
Copy link
Collaborator

@patrickvonplaten patrickvonplaten commented Sep 12, 2024

FILL IN THE PR DESCRIPTION HERE

FIX #8382
FIX #8411

This PR fixes a couple bugs that arrive from using chunked pre-filling and previously incorrect image processing.

This PR makes sure that all images are pre-processed correctly and adds a bunch of aggressive tests.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@DarkLight1337
Copy link
Member

Would be great if you could add a test to avoid similar regressions!

Run `pytest tests/models/test_mistral.py`.
"""
import uuid
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add many more aggressive tests

@patrickvonplaten patrickvonplaten changed the title Fix Pixtral init [Pixtral] Fix multiple images bugs Sep 12, 2024
@patrickvonplaten patrickvonplaten changed the title [Pixtral] Fix multiple images bugs [Hotfix][Pixtral] Fix multiple images bugs Sep 12, 2024
@DarkLight1337
Copy link
Member

DarkLight1337 commented Sep 12, 2024

The tests fail to pass when I run them locally.

_______________________________________________________________________________ test_chat[bfloat16-8192-mistralai/Pixtral-12B-2409] ________________________________________________________________________________

vllm_runner = <class 'tests.conftest.VllmRunner'>, max_model_len = 8192, model = 'mistralai/Pixtral-12B-2409', dtype = 'bfloat16'

    @pytest.mark.parametrize("model", MODELS)
    @pytest.mark.parametrize("max_model_len", [8192, 65536])
    @pytest.mark.parametrize("dtype", ["bfloat16"])
    def test_chat(
        vllm_runner,
        max_model_len: int,
        model: str,
        dtype: str,
    ) -> None:
    
        with vllm_runner(model,
                         dtype=dtype,
                         tokenizer_mode="mistral",
                         enable_chunked_prefill=False,
                         max_model_len=max_model_len,
                         limit_mm_per_prompt=LIMIT_MM_PER_PROMPT) as vllm_model:
            results = []
            for msg in MSGS:
                outputs = vllm_model.model.chat(msg,
                                                sampling_params=SAMPLING_PARAMS)
    
                results.append(outputs[0].outputs[0].text)
    
>           assert results == EXPECTED
E           AssertionError: assert ['The image s... green park.'] == ['The image s... green park.']
E             
E             At index 1 diff: '1. A black dog with a curious expression sits on a wooden floor.\n2. A vast mountain range stretches across the horizon under a cloudy sky.' != '1. A black dog with floppy ears sits attentively on a wooden surface.\n2. A vast mountain range with rugged peaks stretches under a cloudy sky.'
E             Use -v to get more diff

tests/models/test_pixtral.py:114: AssertionError

@patrickvonplaten
Copy link
Collaborator Author

The tests fail to pass when I run them locally.

Hmm I see what device to you run them on? Tested on H100 - for me they are passing

@patrickvonplaten
Copy link
Collaborator Author

patrickvonplaten commented Sep 12, 2024

The tests fail to pass when I run them locally.

Hmm I see what device to you run them on? Tested on H100 - for me they are passing

Guess it's going to be quite difficult to get exactly the same results here across different devices and given that flash attention is not that deterministic of an operation - any tips on how to deal with this on the tests?

@DarkLight1337
Copy link
Member

The tests fail to pass when I run them locally.

Hmm I see what device to you run them on? Tested on H100 - for me they are passing

I'm running the test on a single L40. (I can't fit max_model_len=65536 so can only run the smaller tests)

@patrickvonplaten
Copy link
Collaborator Author

The tests fail to pass when I run them locally.

Hmm I see what device to you run them on? Tested on H100 - for me they are passing

I'm running the test on a single L40. (I can't fit max_model_len=65536 so can only run the smaller tests)

Ok yes that doesn't surprise me. From your test failure it seems like the first test case passes but then the longer / more complicated tests fails. We could add device-dependent expected values? Not sure. Wdyt?

@DarkLight1337
Copy link
Member

DarkLight1337 commented Sep 12, 2024

The tests fail to pass when I run them locally.

Hmm I see what device to you run them on? Tested on H100 - for me they are passing

I'm running the test on a single L40. (I can't fit max_model_len=65536 so can only run the smaller tests)

I think the output is still reasonable. Since the goal of this PR is to avoid crashing the model rather than having perfect output, we can reduce the number of tokens to match for now (at least until we have a HF version to test against).

For the test to be able to run in CI, we may need to split the model via tensor parallel. This is more complicated so I wouldn't enforce that in this PR.

@patrickvonplaten
Copy link
Collaborator Author

The tests fail to pass when I run them locally.

Hmm I see what device to you run them on? Tested on H100 - for me they are passing

I'm running the test on a single L40. (I can't fit max_model_len=65536 so can only run the smaller tests)

I think the output is still reasonable. Since the goal of this PR is to avoid crashing the model rather than having perfect output, we can reduce the number of tokens to match for now (at least until we have a HF version to test against).

I'll comment out the more extreme tests and for now will just them locally when needed

@patrickvonplaten
Copy link
Collaborator Author

Wrapped the more extreme tests in a is_h100 wrapper - does that work?

@DarkLight1337
Copy link
Member

Imo that's too device-specific. If you're able to extract the logprobs information, it would be better to check against the golden output via check_logprobs_close.

@DarkLight1337
Copy link
Member

As for the OOM issue, we can address that via TP in another PR.

@patrickvonplaten
Copy link
Collaborator Author

Imo that's too device-specific. If you're able to extract the logprobs information, it would be better to check against the golden output via check_logprobs_close.

Hmm what is the golden output for you then? Think vLLM should represent the official implementation here

@DarkLight1337
Copy link
Member

You can use your H100 output as the golden one. Checking the logprobs is less strict so the test should still pass on other devices.

@patrickvonplaten
Copy link
Collaborator Author

You can use your H100 output as the golden one. Checking the logprobs is less strict so the test should still pass on other devices.

Do you check that logprobs are within a range? If the output is different for temperature=0.0, the logprobs also quite certainly won't match no? In my experience it's quite difficult to get outputs to exactly match over different devices except for small inputs. I can try to extract the logprobs, but they also won't match between L40 and H100 I'm afraid

@DarkLight1337
Copy link
Member

DarkLight1337 commented Sep 12, 2024

You can use your H100 output as the golden one. Checking the logprobs is less strict so the test should still pass on other devices.

Do you check that logprobs are within a range? If the output is different for temperature=0.0, the logprobs also quite certainly won't match no? In my experience it's quite difficult to get outputs to exactly match over different devices except for small inputs. I can try to extract the logprobs, but they also won't match between L40 and H100 I'm afraid

We just check that for each token outputted by vLLM, the selected token is within the top-k logprobs of the reference (golden) output. You can use the check_logprobs_close utility function for this.

@patrickvonplaten
Copy link
Collaborator Author

You can use your H100 output as the golden one. Checking the logprobs is less strict so the test should still pass on other devices.

Do you check that logprobs are within a range? If the output is different for temperature=0.0, the logprobs also quite certainly won't match no? In my experience it's quite difficult to get outputs to exactly match over different devices except for small inputs. I can try to extract the logprobs, but they also won't match between L40 and H100 I'm afraid

We just check that for each token outputted by vLLM, the selected token is within the top-k logprobs of the reference (golden) output. You can use the check_logprobs_close utility function for this.

Gotcha!

@patrickvonplaten
Copy link
Collaborator Author

patrickvonplaten commented Sep 12, 2024

You can use your H100 output as the golden one. Checking the logprobs is less strict so the test should still pass on other devices.

Do you check that logprobs are within a range? If the output is different for temperature=0.0, the logprobs also quite certainly won't match no? In my experience it's quite difficult to get outputs to exactly match over different devices except for small inputs. I can try to extract the logprobs, but they also won't match between L40 and H100 I'm afraid

We just check that for each token outputted by vLLM, the selected token is within the top-k logprobs of the reference (golden) output. You can use the check_logprobs_close utility function for this.

Gotcha!

Actually sorry even this won't work because errors accumulate and then context changes and the ouput results is very different. E.g. for second example, I can get:

Gold output:

"1. A black dog with floppy ears sits attentively on a wooden surface.\n2. A vast mountain range with rugged peaks stretches under a cloudy sky.",

and

L40:

"1. A black dog with floppy ears sits attentively on a wooden surface.\n2. A vast mountain range stretches across the horizon under a cloudy sky."

Here the first different word is "with" vs. "stretches". "stretches" will be in the topk range of logprobs of "with" but the next words will not be. Currently each of the two tests run a simple test on every device where results match. Just the more difficult tests are only run on H100 so on L40 still two tests are run.

Thoughts?

@DarkLight1337
Copy link
Member

DarkLight1337 commented Sep 12, 2024

Hmm, from my understanding check_logprobs_close compares the logprobs only for the first mismatch, then exits early such that the test passes if those logprobs are consistent enough. The remaining tokens are skipped and thus should not fail the test.

@simon-mo simon-mo mentioned this pull request Sep 12, 2024
7 tasks
Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've run the new test on H100 and it all passed for me too, so I'm giving this a green light. As for the refactor work on the test so that it can run on the L40 machines on our CI, let's do that in a later PR given our timeline for the patch release.

Thanks for fixing!

@patrickvonplaten
Copy link
Collaborator Author

Just added the logprobs tests as explained by @DarkLight1337 - think it indeed makes more sense! Would be great if it passes also on a L40. Thanks for the reviews!

@ywang96 ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 12, 2024
@ywang96 ywang96 merged commit d31174a into vllm-project:main Sep 12, 2024
60 checks passed
Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
garg-amit pushed a commit to garg-amit/vllm that referenced this pull request Oct 28, 2024
LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Pixtral inference not working correctly with LLMEngine/AsyncEngine [Bug]: Pixtral fails when limit_mm_per_prompt not set

3 participants