|
3 | 3 | Using VLMs
|
4 | 4 | ==========
|
5 | 5 |
|
6 |
| -This document shows you how to run and serve Vision Language Models (VLMs) using vLLM. |
| 6 | +vLLM provides experimental support for Vision Language Models (VLMs). This document shows you how to run and serve these models using vLLM. |
7 | 7 |
|
8 | 8 | Engine Arguments
|
9 | 9 | ----------------
|
@@ -54,3 +54,69 @@ For now, we only support a single image per text prompt. To pass an image to the
|
54 | 54 | print(generated_text)
|
55 | 55 |
|
56 | 56 | A code example can be found in `examples/llava_example.py <https://github.com/vllm-project/vllm/blob/main/examples/llava_example.py>`_.
|
| 57 | + |
| 58 | +Online OpenAI Vision API Compatible Inference |
| 59 | +---------------------------------------------- |
| 60 | + |
| 61 | +You can serve vision language models with vLLM's HTTP server that is compatible with `OpenAI Vision API <https://platform.openai.com/docs/guides/vision>`_. |
| 62 | + |
| 63 | +.. note:: |
| 64 | + Currently, vLLM supports only **single** ``image_url`` input per ``messages``. Support for multi-image inputs will be |
| 65 | + added in the future. |
| 66 | + |
| 67 | +Below is an example on how to launch the same ``llava-hf/llava-1.5-7b-hf`` with vLLM API server. |
| 68 | + |
| 69 | +.. important:: |
| 70 | + Since OpenAI Vision API is based on `Chat <https://platform.openai.com/docs/api-reference/chat>`_ API, a chat template |
| 71 | + is **required** to launch the API server if the model's tokenizer does not come with one. In this example, we use the |
| 72 | + HuggingFace Llava chat template that you can find in the example folder `here <https://github.com/vllm-project/vllm/blob/main/examples/template_llava.jinja>`_. |
| 73 | + |
| 74 | +.. code-block:: bash |
| 75 | +
|
| 76 | + python -m vllm.entrypoints.openai.api_server \ |
| 77 | + --model llava-hf/llava-1.5-7b-hf \ |
| 78 | + --image-input-type pixel_values \ |
| 79 | + --image-token-id 32000 \ |
| 80 | + --image-input-shape 1,3,336,336 \ |
| 81 | + --image-feature-size 576 \ |
| 82 | + --chat-template template_llava.jinja |
| 83 | +
|
| 84 | +To consume the server, you can use the OpenAI client like in the example below: |
| 85 | + |
| 86 | +.. code-block:: python |
| 87 | +
|
| 88 | + from openai import OpenAI |
| 89 | + openai_api_key = "EMPTY" |
| 90 | + openai_api_base = "http://localhost:8000/v1" |
| 91 | + client = OpenAI( |
| 92 | + api_key=openai_api_key, |
| 93 | + base_url=openai_api_base, |
| 94 | + ) |
| 95 | + chat_response = client.chat.completions.create( |
| 96 | + model="llava-hf/llava-1.5-7b-hf", |
| 97 | + messages=[{ |
| 98 | + "role": "user", |
| 99 | + "content": [ |
| 100 | + {"type": "text", "text": "What's in this image?"}, |
| 101 | + { |
| 102 | + "type": "image_url", |
| 103 | + "image_url": { |
| 104 | + "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg", |
| 105 | + }, |
| 106 | + }, |
| 107 | + ], |
| 108 | + }], |
| 109 | + ) |
| 110 | + print("Chat response:", chat_response) |
| 111 | +
|
| 112 | +.. note:: |
| 113 | + |
| 114 | + By default, the timeout for fetching images through http url is ``5`` seconds. You can override this by setting the environment variable: |
| 115 | + |
| 116 | + .. code-block:: shell |
| 117 | +
|
| 118 | + export VLLM_IMAGE_FETCH_TIMEOUT=<timeout> |
| 119 | +
|
| 120 | +.. note:: |
| 121 | + The prompt formatting with the image token ``<image>`` is not needed when serving VLMs with the API server since the prompt will be |
| 122 | + processed automatically by the server. |
0 commit comments