Skip to content

[CI Failure]: Distributed Tests (2 GPUs) - Mllama TP=2 results divergence and deadlock issue #22559

@Isotr0py

Description

@Isotr0py

Name of failing test

models/multimodal/generation/test_mllama.py test_models_distributed

Basic information

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)

🧪 Describe the failing test

Mllama tp test was broken (https://buildkite.com/vllm/ci/builds/26344#01988898-4fbc-4d24-bd78-7325a1b6d9e3) with divergent results:

[2025-08-08T10:26:49Z] Traceback (most recent call last):
--
  | [2025-08-08T10:26:49Z]   File "/vllm-workspace/tests/utils.py", line 742, in wrapper
  | [2025-08-08T10:26:49Z]     f(*args, **kwargs)
  | [2025-08-08T10:26:49Z]   File "/vllm-workspace/tests/models/multimodal/generation/test_mllama.py", line 415, in test_models_distributed
  | [2025-08-08T10:26:49Z]     run_test(
  | [2025-08-08T10:26:49Z]   File "/vllm-workspace/tests/models/multimodal/generation/test_mllama.py", line 174, in run_test
  | [2025-08-08T10:26:49Z]     _run_test(
  | [2025-08-08T10:26:49Z]   File "/vllm-workspace/tests/models/multimodal/generation/test_mllama.py", line 245, in _run_test
  | [2025-08-08T10:26:49Z]     check_logprobs_close(
  | [2025-08-08T10:26:49Z]   File "/vllm-workspace/tests/models/utils.py", line 228, in check_logprobs_close
  | [2025-08-08T10:26:49Z]     assert output_id_0 in logprobs_elem_1, fail_msg
  | [2025-08-08T10:26:49Z]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  | [2025-08-08T10:26:49Z] AssertionError: Test2:
  | [2025-08-08T10:26:49Z] Matched tokens:	[]
  | [2025-08-08T10:26:49Z] hf:	' blanketed in pink and green as the cherry blossoms reach their peak. The cherry blossoms are in full bloom, and the sky is a brilliant blue. The tower is a tall, slender structure that rises high above the surrounding buildings. The tower is made of glass and steel, and it is surrounded by a series'	{10321: -2.783482551574707, 304: -3.095982551574707, 11495: -3.470982551574707, 14545: -3.470982551574707, 13989: -3.533482551574707}
  | [2025-08-08T10:26:49Z] vllm:	" in full bloom, with cherry blossoms framing the Tokyo Tower in the background. The vibrant pink flowers are a beautiful sight to behold, and the tower's white and gray color scheme provides a striking contrast. The image captures the beauty of spring in Tokyo, with the cherry blossoms adding a touch of elegance to the scene"	{304: Logprob(logprob=-2.2309975624084473, rank=1, decoded_token=' in'), 264: Logprob(logprob=-2.8559975624084473, rank=2, decoded_token=' a'), 84273: Logprob(logprob=-3.4809975624084473, rank=3, decoded_token=' adorned'), 14545: Logprob(logprob=-3.4809975624084473, rank=4, decoded_token=' blo'), 2539: Logprob(logprob=-3.4809975624084473, rank=5, decoded_token=' full')}

After ccdae73, recent distributed tests started hanging due to deadlock after running one test: https://buildkite.com/vllm/ci/builds/26350#019888be-96c2-4522-b50a-1bae59462b91

📝 History of failing test

CI history: https://buildkite.com/organizations/vllm/analytics/suites/ci-1/tests/5bb02efa-9caf-8d44-80c3-d3b0789808da?period=7days

CC List.

Also cc @njhill about the deadlock issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    ci-failureIssue about an unexpected test failure in CI

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions