[Usage]: Bad Request with multiple multimodal inputs when using vision LLM.

### Your current environment

```text
The output of `python collect_env.py`
```


### How would you like to use vllm

I have tried InterVL and MimiCPM by requesting with multiple multimodal inputs, but both failed to response and it comes with bad request error. I have done some research and noticed some VLMs like phi-3 already support such inputs. https://github.com/vllm-project/vllm/issues/5820. Is this feature still under construction?  or Did I miss anything?


### ONLINE INFER EXAMPLE
```
from openai import OpenAI

client = OpenAI()
response = client.chat.completions.create(
  model="xxx",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What are in these images? Is there any difference between them?",
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
          },
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
          },
        },
      ],
    }
  ],
  max_tokens=300,
)
print(response.choices[0])
```

### Before submitting a new issue...

- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Usage]: Bad Request with multiple multimodal inputs when using vision LLM. #8053

Your current environment

How would you like to use vllm

ONLINE INFER EXAMPLE

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Usage]: Bad Request with multiple multimodal inputs when using vision LLM. #8053

Description

Your current environment

How would you like to use vllm

ONLINE INFER EXAMPLE

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions