[V1] Prefix caching for multimodal language models #11187

comaniac · 2024-12-13T23:47:58Z

This PR enables prefix caching for VLMs. Specifically, we enhanced the KV block hash to support extra keys with the image hash and offset.

Block Hash Format

Taking a series of 3 blocks as an example: T0,T1,P00,P01 | P02,P03,P04,T2 | T3,P10,P11,P12, where Ti is i-th text token and Pxy is the y-th placeholder token of the x-th image, so this prompt has 2 images (P0 and P1). Assuming the image hash of P0 and P1 is aaa and bbb, respectively, and mm_positions=[(offset=2, length=5), (offset=9, length=3)], the hash of 3 blocks is as follows

# (Parent hash,
#  token ID w. placeholders,
#  image hash, start)
hash0 = hash(None, T0,T1,P00,P01, (aaa,0))
hash1 = hash(hash0, P02,P03,P04,T2, (aaa,2))
hash2 = hash(hash1, T3,P10,P11,P12, (bbb,0))

A more straightforward is to embed the image hash and offset directly:

hash0 = hash(None, T0,T1,(aaa,0),(aaa,1))

We don't adopt this approach because it needs to traverse all input tokens and replace placeholder tokens with the tuple.

Performance Optimization

To reduce the overhead of computing the extra keys of each block, this PR adds an optimization that caches the computed hash values in Request, so that we guarantee the block hash for a request only needs to be computed once.

Benchmark

We benchmarked the throughput using Llava-1.6-Mistral-7B with 500 prompts on L40S GPU. The image hit rate is set to 30%, meaning that we have 500*0.7=350 unique images and 500-350=150 redundant requests. We put the redundant requests together to achieve the best cache locality for better illustration the effectiveness of prefix caching. The benchmark script is https://gist.github.com/comaniac/ea26df17fdffa533cf53d53b8455bc31

VLLM_USE_V1=1 VLLM_ENABLE_V1_MULTIPROCESSING=1 python3 mmmu_bench.py --model llava-hf/llava-v1.6-mistral-7b-hf --num-prompts 500 --image-hit-rate 0.3 --no-enable-prefix-caching
> Throughput: 3.84 req/s

VLLM_USE_V1=1 VLLM_ENABLE_V1_MULTIPROCESSING=1 python3 mmmu_bench.py --model llava-hf/llava-v1.6-mistral-7b-hf --num-prompts 500  --image-hit-rate 0.3 --mm-cache-preprocessor --no-enable-prefix-caching
> Throughput: 3.85 req/s

VLLM_USE_V1=1 VLLM_ENABLE_V1_MULTIPROCESSING=1 python3 mmmu_bench.py --model llava-hf/llava-v1.6-mistral-7b-hf --num-prompts 500  --image-hit-rate 0.3 --mm-cache-preprocessor
> Throughput: 7.08 req/s

Note: Now prefix caching for VLMs is enabled by default, but it requires the image hashes from mm cache preprocessor, so the following command (enabled prefix caching w/o mm cache preprocessor) will result in error. @alexm-neuralmagic please let me know what's the best practice for this.

VLLM_USE_V1=1 VLLM_ENABLE_V1_MULTIPROCESSING=1 python3 mmmu_bench.py --model llava-hf/llava-v1.6-mistral-7b-hf --num-prompts 500  --image-hit-rate 0.3

cc @alexm-neuralmagic @ywang96 @rickyyx

github-actions · 2024-12-13T23:48:10Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

vllm/v1/request.py

ywang96

@comaniac Thanks for this great work! Overall the code looks clean and I have left some comments. PTAL!

vllm/engine/arg_utils.py

vllm/v1/core/kv_cache_utils.py

vllm/v1/request.py

vllm/v1/core/kv_cache_utils.py

mergify · 2024-12-15T02:44:06Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @comaniac.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

rickyyx

This looks really great! I mainly chimed in for some nits.

The only main question I have is on the generate_block_hash_extra_keys routine, which I feel there's some potential to make it easier to reason with. But I might be overlooking some constraints that have driven its current impl.

vllm/inputs/data.py

vllm/v1/core/kv_cache_manager.py

vllm/v1/core/kv_cache_utils.py

sleepwalker2017 · 2024-12-16T11:45:05Z

export VLLM_USE_V1=1
Is this a must?
I export it, and vllm complains

ERROR 12-16 12:00:25 core.py:263] Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

If I run without this, it runs ok.

alexm-redhat · 2024-12-16T15:28:08Z

@comaniac I will modify the code so you don't get an error without mm cache preprocessor. Will do it on your PR and send you the patch.

Signed-off-by: Cody Yu <[email protected]>

comaniac · 2024-12-16T22:20:59Z

All comments should have been addressed. PTAL @ywang96 @alexm-neuralmagic @rickyyx.

Highlights:

Alex's patch is applied so we don't have to enable mm preprocessor to make prefix caching work, although enabling mm preprocessor is still recommended for better performance.
The default behavior of enable_prefix_caching is changed to the following
- v0: Default off and force off for MM models.
- v1, text-only models: Default on.
- v1, MM models: Default off.

comaniac · 2024-12-17T02:30:04Z

Note: CI failure is unrelated.

alexm-redhat

LGTM! @comaniac thanks for making prefix caching work for VLMs! Just some nits

vllm/v1/core/kv_cache_utils.py

vllm/v1/core/scheduler.py

vllm/v1/request.py

Signed-off-by: Cody Yu <[email protected]>

ywang96

LGTM! I've shared some benchmark results on Slack.

The negative impact of APC is minimal at even 0% hit rate, so I think this PR is good to go!

Signed-off-by: Cody Yu <[email protected]>

comaniac · 2024-12-17T17:10:09Z

I found that it's tricky to configure different the default value of prefix caching for MM, because we don't know what model to serve when creating engine config from CLI. So now I enable prefix caching by default for all models in v1. We should mention in the blogpost/announcement that if users encounter any errors with MM in v1, disabling prefix caching is one of the things they could try to workaround.

cc @ywang96 @WoosukKwon

Signed-off-by: Cody Yu <[email protected]>

rickyyx

Signed-off-by: Cody Yu <[email protected]>

ywang96 · 2024-12-31T17:49:59Z

Going to rename this PR to Prefix caching for multimodal language models since the underlying logic is not tied to image input format!

Signed-off-by: Cody Yu <[email protected]>

This adds prefix caching support for some VLM models. This has only been ad hoc tested with Qwen2.5VL. This should (hopefully) work with other VLM models like InternVL / Pixtral / Llama Vision with some additional effort. That will be done in followups. The implementation of prefix caching is inspired by vLLM's implementation (vllm-project/vllm#11187). However, we diverge slightly in our implementation. Say `<vision_token_id>` is 98, `<vision_start_token_id>` is 97, and `<vision_end_token_id>` is 99. We ignore the <vision_start_token_id> and <vision_end_token_id> tokens. Consider the following context: ``` |<-- img0 --->| |<--- img1 -->| 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 tokens = np.array([51, 52, 53, 54, 97, 98, 98, 98, 98, 99, 55, 56, 57, 58, 97, 98, 98, 98, 98, 99, 59, 60, 61, 62]) ``` Notice that this context has two images. - img0: [5,9) - img1: [14,19) Recipe: - First hash the pixel values for the two images. Say `hash(img_1) => 27712389489` and `hash(img_2) => -90834898922`. - Replace the first `<vision_token_id>` in the token array with the hash for that image. - Chunk the token array and compute hashes as usual. Eg: ``` |<------ img0 ------->| |<-------- img1 ------->| 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 tokens = np.array([51, 52, 53, 54, 97, 27712389489, 98, 98, 98, 99, 55, 56, 57, 58, 97, -90834898922, 98, 98, 98, 99, 59, 60, 61, 62]) ``` Say PageSize == 4: ``` hash1 = hash(None , [ 11 12 13 14 ]) => hasha hash2 = hash(hasha , [ 27712389489 999 999 999 ]) => hashb hash3 = hash(hashb , [ 999 1000 15 16 ]) => hashc ... ``` Notice that this allows us to get prefix cache hits in some prefix of the prompt, even if the request supplies a different image later in the prompt. Also this design trivially supports multiple images. WARNING!!!! This implementation has some limitations, namely that we will rerun vision encoding for ALL images in the prompt even if we get a cache hit on all tokens of some images. For example, if a req has img0 and img1 and we say we get cache hit on img0. In an ideal impl, we will run vision encoding just for img1. However, in this impl, we will run vision encoding on both img0 and img1. We then do a gather to get the subset of embeddings we need. Then we do a scatter operations on that subset of embeddings to update the text embs at the correct indices. If we need to encode > 1 image, we encode all N. If we need to encode 0 images, we encode 0 images. As such, this impl solely speeds up CE of text model but does not speed up vision encoding unless we get a cache hit on all images in prompt. The reason we do this is because it is sort of hard to do the right thing since we precompute the vision encoder inputs in the tokenizer. We need to fix our code to stop adding a million things to `extra_model_args`. This is extremely hard to work with. We end up needing to perform a masked scatter op to update the relevant rows of text embeddings corresponding to image tokens. The image embeddings that are unneeded due to prefix cache hits are masked, while the rest are scattered. This masked scatter is implemented via a gather -> scatter. The correctness bug I was hunting down ended up being due to a mis-preparation of the decoder_position_ids inputs. The "Recompute this value on the fly." branch of the if statement led me to believe I could just run that code to recompute the position ids on prefix cache hit. However, that is not the case as that branch only handles case where next_tokens has no image, like during TG. While hunting down that bug, I also noticed that we are likely not handling preemptions appropriately in the existing code. This may cause correctness issues after a preemption. ``` [M] ubuntu@workspace-bez-a100-57bf475f6-qmpjn:~/syncmodular$ br //:max --config disable-mypy -- serve --model Qwen/Qwen2.5-VL-32B-Instruct --device-memory-utilization 0.85 --no-enable-chunked-prefill --trust-remote-code --devices gpu:0,1 INFO: Invocation ID: 8090289f-32d9-463f-8e54-448fdfe21711 INFO: Streaming build results to: https://modular.buildbuddy.io/invocation/8090289f-32d9-463f-8e54-448fdfe21711 INFO: Analyzed target //:max (0 packages loaded, 2 targets and 1 aspect configured). INFO: Found 1 target... Target //SDK/lib/API/python/max/entrypoints:pipelines up-to-date: bazel-bin/SDK/lib/API/python/max/entrypoints/pipelines Aspect //bazel/internal:mypy.bzl%mypy_aspect of //:max up-to-date (nothing to build) INFO: Elapsed time: 1.569s, Critical Path: 0.14s INFO: 4 processes: 1 action cache hit, 4 internal. INFO: Build completed successfully, 4 total actions INFO: Streaming build results to: https://modular.buildbuddy.io/invocation/8090289f-32d9-463f-8e54-448fdfe21711 19:43:32.585 INFO: Metrics initialized. 19:43:40.215 INFO: 19:43:40.215 INFO: ============================================================ 19:43:40.215 INFO: Pipeline Configuration (use --pretty-print-config to print full config) 19:43:40.215 INFO: ============================================================ 19:43:40.215 INFO: model : Qwen/Qwen2.5-VL-32B-Instruct 19:43:40.215 INFO: architecture : Qwen2_5_VLForConditionalGeneration 19:43:40.215 INFO: pipeline : TextGenerationPipeline 19:43:40.215 INFO: devices : gpu[0], gpu[1] 19:43:40.215 INFO: max_batch_size : 512 19:43:40.215 INFO: max_seq_len : 128000 19:43:40.215 INFO: cache_memory : 64.09 GiB 19:43:40.215 INFO: 19:43:41.393 INFO: Starting server... Starting ASGI metrics server on port 8001 19:43:46.070 INFO: Metrics initialized. 19:43:46.662 INFO: Starting download of model: Qwen/Qwen2.5-VL-32B-Instruct 100%|██████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 72.36it/s] 19:43:46.919 INFO: Finished download of model: Qwen/Qwen2.5-VL-32B-Instruct in 0.256844 seconds. 19:44:06.058 INFO: Building and compiling vision model... 19:45:39.941 INFO: Building and compiling vision model took 93.882900 seconds 19:45:39.941 INFO: Building and compiling language model... 19:47:20.563 INFO: Building and compiling language model took 100.621737 seconds 19:47:27.918 INFO: ******************************************************************************** 🚀 Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit) ******************************************************************************** 19:47:48.497 INFO: Executed CE batch with 1 reqs | Terminated: 0 reqs, Pending: 0 reqs | Input Tokens: 2022/8192 toks | Prompt Tput: 1.9K tok/s, Generation Tput: 0.9 tok/s | Batch creation: 285.01us, Execution: 1.08s | KVCache usage: 0.8% of 2050 blocks | Cache hit rate: 0.0% | All Preemptions: 0 reqs 19:47:48.740 INFO: Executed TG batch with 1 reqs | Terminated: 0 reqs, Pending: 0 reqs | Input Tokens: 1/INF toks | Prompt Tput: 4.1 tok/s, Generation Tput: 41.1 tok/s | Batch creation: 66.45us, Execution: 243.34ms | KVCache usage: 0.8% of 2050 blocks | Cache hit rate: 0.0% | All Preemptions: 0 reqs 19:47:49.051 INFO: Executed TG batch with 1 reqs | Terminated: 1 reqs, Pending: 0 reqs | Input Tokens: 1/INF toks | Prompt Tput: 3.2 tok/s, Generation Tput: 29.0 tok/s | Batch creation: 38.93us, Execution: 310.24ms | KVCache usage: 0.0% of 2050 blocks | Cache hit rate: 0.0% | All Preemptions: 0 reqs 19:47:49.656 INFO: Executed CE batch with 1 reqs | Terminated: 0 reqs, Pending: 0 reqs | Input Tokens: 102/8192 toks | Prompt Tput: 265.2 tok/s, Generation Tput: 2.6 tok/s | Batch creation: 199.94us, Execution: 384.66ms | KVCache usage: 0.8% of 2050 blocks | Cache hit rate: 95.0% | All Preemptions: 0 reqs 19:47:49.896 INFO: Executed TG batch with 1 reqs | Terminated: 0 reqs, Pending: 0 reqs | Input Tokens: 1/INF toks | Prompt Tput: 4.2 tok/s, Generation Tput: 41.8 tok/s | Batch creation: 41.60us, Execution: 239.10ms | KVCache usage: 0.8% of 2050 blocks | Cache hit rate: 0.0% | All Preemptions: 0 reqs 19:47:50.111 INFO: Executed TG batch with 1 reqs | Terminated: 1 reqs, Pending: 0 reqs | Input Tokens: 1/INF toks | Prompt Tput: 4.6 tok/s, Generation Tput: 41.8 tok/s | Batch creation: 35.55us, Execution: 215.12ms | KVCache usage: 0.0% of 2050 blocks | Cache hit rate: 0.0% | All Preemptions: 0 reqs ``` The relevant log is: ``` Cache hit rate: 95.0% ``` ``` [M] ubuntu@workspace-bez-a100-57bf475f6-qmpjn:~/syncmodular$ br //open-source/max/benchmark:benchmark_serving --config disable-mypy -- --backend modular --model Qwen/Qwen2.5-VL-32B-Instruct --endpoint /v1/chat/completions --host localhost --port 8000 --dataset-name random --random-input-len 40 --random-output-len 15 --random-image-size 512,512 --random-image-count=6 --random-coefficient-of-variation 0.1,0.6 --num-prompts 1 INFO: Invocation ID: 8ca467d1-76dc-4239-89da-e18de78cae4f INFO: Streaming build results to: https://modular.buildbuddy.io/invocation/8ca467d1-76dc-4239-89da-e18de78cae4f INFO: Analyzed target //open-source/max/benchmark:benchmark_serving (0 packages loaded, 4 targets and 1 aspect configured). INFO: Found 1 target... Target //open-source/max/benchmark:benchmark_serving up-to-date: bazel-bin/open-source/max/benchmark/benchmark_serving Aspect //bazel/internal:mypy.bzl%mypy_aspect of //open-source/max/benchmark:benchmark_serving up-to-date (nothing to build) INFO: Elapsed time: 0.900s, Critical Path: 0.09s INFO: 5 processes: 1 action cache hit, 5 internal. INFO: Build completed successfully, 5 total actions INFO: Streaming build results to: https://modular.buildbuddy.io/invocation/8ca467d1-76dc-4239-89da-e18de78cae4f None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. 19:47:46.597 INFO: benchmark_serving: Namespace(arxiv_summarization_input_len=15000, backend='modular', base_url=None, batch_job_image_dir=None, burstiness=1.0, chat_warmup_delay_ms=0.0, collect_cpu_stats=True, collect_gpu_stats=False, collect_server_stats=False, config_file=None, dataset_mode=<DatasetMode.HUGGINGFACE: 'huggingface'>, dataset_name='random', dataset_path=None, delay_between_chat_turns=None, disable_tqdm=False, endpoint='/v1/chat/completions', host='localhost', ignore_first_turn_stats=False, lora=None, lora_output_dir='/tmp/loras', lora_paths=[], lora_rank=16, lora_request_ratio=0.5, lora_server_path='/tmp/loras', lora_target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj'], max_benchmark_duration_s=None, max_concurrency=None, max_concurrent_lora_ops=1, max_num_loras=10, max_output_len=None, metadata=[], model='Qwen/Qwen2.5-VL-32B-Instruct', model_max_length=None, num_chat_sessions=None, num_loras=0, num_prompts=1, obfuscated_conversations_average_output_len=175, obfuscated_conversations_coefficient_of_variation=0.1, obfuscated_conversations_shuffle=False, output_lengths=None, port=8000, print_inputs_and_outputs=False, random_coefficient_of_variation='0.1,0.6', random_distribution_type='normal', random_first_turn_ratio=1.0, random_image_count=6, random_image_size='512,512', random_input_len=40, random_max_num_unique_sys_prompt=1, random_num_turns=1, random_output_len=15, random_sys_prompt_ratio=0.0, record_output_lengths=None, request_rate='inf', result_dir=None, result_filename=None, save_result=False, seed=0, server_args='', skip_first_n_requests=0, skip_test_prompt=False, sonnet_input_len=550, sonnet_prefix_len=200, temperature=0.0, tokenizer=None, top_k=None, top_p=1.0, trust_remote_code=False) 19:47:46.597 INFO: benchmark_serving: getting tokenizer. api url: http://localhost:8000/v1/chat/completions 19:47:47.116 INFO: benchmark_serving: sampling requests 19:47:47.116 INFO: benchmark_shared.datasets.random: Random samples in normal distribution 19:47:47.169 INFO: benchmark_serving: starting benchmark run 19:47:47.170 INFO: benchmark_serving: Starting initial single prompt test run... 19:47:49.065 INFO: benchmark_serving: Initial test run completed. Starting main benchmark run... 19:47:49.065 INFO: benchmark_serving: Input request rate: inf 19:47:49.065 INFO: benchmark_serving: Burstiness factor: 1.0 (Poisson process) 19:47:49.065 INFO: benchmark_serving: Maximum request concurrency: None 100%|███████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.02s/it] ============ Serving Benchmark Result ============ Successful requests: 1 Failed requests: 0 Benchmark duration (s): 1.02 Total input tokens: 1583 Total generated tokens: 19 Total nonempty serving response chunks: 19 Input request rate (req/s): inf Request throughput (req/s): 0.97580 ------------Client Experience Metrics------------- Max Concurrency: 1 Mean input token throughput (tok/s): 2820.11 Std input token throughput (tok/s): 0.00 Median input token throughput (tok/s): 2820.11 P90 input token throughput (tok/s): 2820.11 P95 input token throughput (tok/s): 2820.11 P99 input token throughput (tok/s): 2820.11 Mean output token throughput (tok/s): 39.24 Std output token throughput (tok/s): 0.00 Median output token throughput (tok/s): 39.24 P90 output token throughput (tok/s): 39.24 P95 output token throughput (tok/s): 39.24 P99 output token throughput (tok/s): 39.24 ---------------Time to First Token---------------- Mean TTFT (ms): 561.33 Std TTFT (ms): 0.00 Median TTFT (ms): 561.33 P90 TTFT (ms): 561.33 P95 TTFT (ms): 561.33 P99 TTFT (ms): 561.33 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 25.48 Std TPOT (ms): 0.00 Median TPOT (ms): 25.48 P90 TPOT (ms): 25.48 P95 TPOT (ms): 25.48 P99 TPOT (ms): 25.48 ---------------Inter-token Latency---------------- Mean ITL (ms): 25.36 Std ITL (ms): 71.50 Median ITL (ms): 0.11 P90 ITL (ms): 65.15 P95 ITL (ms): 219.78 P99 ITL (ms): 234.45 -------------Per-Request E2E Latency-------------- Mean Request Latency (ms): 1019.98 Std Request Latency (ms): 0.00 Median Request Latency (ms): 1019.98 P90 Request Latency (ms): 1019.98 P95 Request Latency (ms): 1019.98 P99 Request Latency (ms): 1019.98 -------------------Token Stats-------------------- Max input tokens: 1583 Max output tokens: 19 Max total tokens: 1602 --------------------CPU Stats--------------------- CPU User Utilization (%): 202.88 CPU System Utilization (%): 5.85 ================================================== 19:47:50.126 INFO: benchmark_serving: finished benchmark run: Success. ``` MODULAR_ORIG_COMMIT_REV_ID: 0ca0f2ec8724d830dcd06af48862e2b981b3ddc5

This adds prefix caching support for some VLM models. This has only been ad hoc tested with Qwen2.5VL. This should (hopefully) work with other VLM models like InternVL / Pixtral / Llama Vision with some additional effort. That will be done in followups. The implementation of prefix caching is inspired by vLLM's implementation (vllm-project/vllm#11187). However, we diverge slightly in our implementation. Say `<vision_token_id>` is 98, `<vision_start_token_id>` is 97, and `<vision_end_token_id>` is 99. We ignore the <vision_start_token_id> and <vision_end_token_id> tokens. Consider the following context: ``` |<-- img0 --->| |<--- img1 -->| 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 tokens = np.array([51, 52, 53, 54, 97, 98, 98, 98, 98, 99, 55, 56, 57, 58, 97, 98, 98, 98, 98, 99, 59, 60, 61, 62]) ``` Notice that this context has two images. - img0: [5,9) - img1: [14,19) Recipe: - First hash the pixel values for the two images. Say `hash(img_1) => 27712389489` and `hash(img_2) => -90834898922`. - Replace the first `<vision_token_id>` in the token array with the hash for that image. - Chunk the token array and compute hashes as usual. Eg: ``` |<------ img0 ------->| |<-------- img1 ------->| 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 tokens = np.array([51, 52, 53, 54, 97, 27712389489, 98, 98, 98, 99, 55, 56, 57, 58, 97, -90834898922, 98, 98, 98, 99, 59, 60, 61, 62]) ``` Say PageSize == 4: ``` hash1 = hash(None , [ 11 12 13 14 ]) => hasha hash2 = hash(hasha , [ 27712389489 999 999 999 ]) => hashb hash3 = hash(hashb , [ 999 1000 15 16 ]) => hashc ... ``` Notice that this allows us to get prefix cache hits in some prefix of the prompt, even if the request supplies a different image later in the prompt. Also this design trivially supports multiple images. WARNING!!!! This implementation has some limitations, namely that we will rerun vision encoding for ALL images in the prompt even if we get a cache hit on all tokens of some images. For example, if a req has img0 and img1 and we say we get cache hit on img0. In an ideal impl, we will run vision encoding just for img1. However, in this impl, we will run vision encoding on both img0 and img1. We then do a gather to get the subset of embeddings we need. Then we do a scatter operations on that subset of embeddings to update the text embs at the correct indices. If we need to encode > 1 image, we encode all N. If we need to encode 0 images, we encode 0 images. As such, this impl solely speeds up CE of text model but does not speed up vision encoding unless we get a cache hit on all images in prompt. The reason we do this is because it is sort of hard to do the right thing since we precompute the vision encoder inputs in the tokenizer. We need to fix our code to stop adding a million things to `extra_model_args`. This is extremely hard to work with. We end up needing to perform a masked scatter op to update the relevant rows of text embeddings corresponding to image tokens. The image embeddings that are unneeded due to prefix cache hits are masked, while the rest are scattered. This masked scatter is implemented via a gather -> scatter. The correctness bug I was hunting down ended up being due to a mis-preparation of the decoder_position_ids inputs. The "Recompute this value on the fly." branch of the if statement led me to believe I could just run that code to recompute the position ids on prefix cache hit. However, that is not the case as that branch only handles case where next_tokens has no image, like during TG. While hunting down that bug, I also noticed that we are likely not handling preemptions appropriately in the existing code. This may cause correctness issues after a preemption. ``` [M] ubuntu@workspace-bez-a100-57bf475f6-qmpjn:~/syncmodular$ br //:max --config disable-mypy -- serve --model Qwen/Qwen2.5-VL-32B-Instruct --device-memory-utilization 0.85 --no-enable-chunked-prefill --trust-remote-code --devices gpu:0,1 INFO: Invocation ID: 8090289f-32d9-463f-8e54-448fdfe21711 INFO: Streaming build results to: https://modular.buildbuddy.io/invocation/8090289f-32d9-463f-8e54-448fdfe21711 INFO: Analyzed target //:max (0 packages loaded, 2 targets and 1 aspect configured). INFO: Found 1 target... Target //SDK/lib/API/python/max/entrypoints:pipelines up-to-date: bazel-bin/SDK/lib/API/python/max/entrypoints/pipelines Aspect //bazel/internal:mypy.bzl%mypy_aspect of //:max up-to-date (nothing to build) INFO: Elapsed time: 1.569s, Critical Path: 0.14s INFO: 4 processes: 1 action cache hit, 4 internal. INFO: Build completed successfully, 4 total actions INFO: Streaming build results to: https://modular.buildbuddy.io/invocation/8090289f-32d9-463f-8e54-448fdfe21711 19:43:32.585 INFO: Metrics initialized. 19:43:40.215 INFO: 19:43:40.215 INFO: ============================================================ 19:43:40.215 INFO: Pipeline Configuration (use --pretty-print-config to print full config) 19:43:40.215 INFO: ============================================================ 19:43:40.215 INFO: model : Qwen/Qwen2.5-VL-32B-Instruct 19:43:40.215 INFO: architecture : Qwen2_5_VLForConditionalGeneration 19:43:40.215 INFO: pipeline : TextGenerationPipeline 19:43:40.215 INFO: devices : gpu[0], gpu[1] 19:43:40.215 INFO: max_batch_size : 512 19:43:40.215 INFO: max_seq_len : 128000 19:43:40.215 INFO: cache_memory : 64.09 GiB 19:43:40.215 INFO: 19:43:41.393 INFO: Starting server... Starting ASGI metrics server on port 8001 19:43:46.070 INFO: Metrics initialized. 19:43:46.662 INFO: Starting download of model: Qwen/Qwen2.5-VL-32B-Instruct 100%|██████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 72.36it/s] 19:43:46.919 INFO: Finished download of model: Qwen/Qwen2.5-VL-32B-Instruct in 0.256844 seconds. 19:44:06.058 INFO: Building and compiling vision model... 19:45:39.941 INFO: Building and compiling vision model took 93.882900 seconds 19:45:39.941 INFO: Building and compiling language model... 19:47:20.563 INFO: Building and compiling language model took 100.621737 seconds 19:47:27.918 INFO: ******************************************************************************** 🚀 Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit) ******************************************************************************** 19:47:48.497 INFO: Executed CE batch with 1 reqs | Terminated: 0 reqs, Pending: 0 reqs | Input Tokens: 2022/8192 toks | Prompt Tput: 1.9K tok/s, Generation Tput: 0.9 tok/s | Batch creation: 285.01us, Execution: 1.08s | KVCache usage: 0.8% of 2050 blocks | Cache hit rate: 0.0% | All Preemptions: 0 reqs 19:47:48.740 INFO: Executed TG batch with 1 reqs | Terminated: 0 reqs, Pending: 0 reqs | Input Tokens: 1/INF toks | Prompt Tput: 4.1 tok/s, Generation Tput: 41.1 tok/s | Batch creation: 66.45us, Execution: 243.34ms | KVCache usage: 0.8% of 2050 blocks | Cache hit rate: 0.0% | All Preemptions: 0 reqs 19:47:49.051 INFO: Executed TG batch with 1 reqs | Terminated: 1 reqs, Pending: 0 reqs | Input Tokens: 1/INF toks | Prompt Tput: 3.2 tok/s, Generation Tput: 29.0 tok/s | Batch creation: 38.93us, Execution: 310.24ms | KVCache usage: 0.0% of 2050 blocks | Cache hit rate: 0.0% | All Preemptions: 0 reqs 19:47:49.656 INFO: Executed CE batch with 1 reqs | Terminated: 0 reqs, Pending: 0 reqs | Input Tokens: 102/8192 toks | Prompt Tput: 265.2 tok/s, Generation Tput: 2.6 tok/s | Batch creation: 199.94us, Execution: 384.66ms | KVCache usage: 0.8% of 2050 blocks | Cache hit rate: 95.0% | All Preemptions: 0 reqs 19:47:49.896 INFO: Executed TG batch with 1 reqs | Terminated: 0 reqs, Pending: 0 reqs | Input Tokens: 1/INF toks | Prompt Tput: 4.2 tok/s, Generation Tput: 41.8 tok/s | Batch creation: 41.60us, Execution: 239.10ms | KVCache usage: 0.8% of 2050 blocks | Cache hit rate: 0.0% | All Preemptions: 0 reqs 19:47:50.111 INFO: Executed TG batch with 1 reqs | Terminated: 1 reqs, Pending: 0 reqs | Input Tokens: 1/INF toks | Prompt Tput: 4.6 tok/s, Generation Tput: 41.8 tok/s | Batch creation: 35.55us, Execution: 215.12ms | KVCache usage: 0.0% of 2050 blocks | Cache hit rate: 0.0% | All Preemptions: 0 reqs ``` The relevant log is: ``` Cache hit rate: 95.0% ``` ``` [M] ubuntu@workspace-bez-a100-57bf475f6-qmpjn:~/syncmodular$ br //open-source/max/benchmark:benchmark_serving --config disable-mypy -- --backend modular --model Qwen/Qwen2.5-VL-32B-Instruct --endpoint /v1/chat/completions --host localhost --port 8000 --dataset-name random --random-input-len 40 --random-output-len 15 --random-image-size 512,512 --random-image-count=6 --random-coefficient-of-variation 0.1,0.6 --num-prompts 1 INFO: Invocation ID: 8ca467d1-76dc-4239-89da-e18de78cae4f INFO: Streaming build results to: https://modular.buildbuddy.io/invocation/8ca467d1-76dc-4239-89da-e18de78cae4f INFO: Analyzed target //open-source/max/benchmark:benchmark_serving (0 packages loaded, 4 targets and 1 aspect configured). INFO: Found 1 target... Target //open-source/max/benchmark:benchmark_serving up-to-date: bazel-bin/open-source/max/benchmark/benchmark_serving Aspect //bazel/internal:mypy.bzl%mypy_aspect of //open-source/max/benchmark:benchmark_serving up-to-date (nothing to build) INFO: Elapsed time: 0.900s, Critical Path: 0.09s INFO: 5 processes: 1 action cache hit, 5 internal. INFO: Build completed successfully, 5 total actions INFO: Streaming build results to: https://modular.buildbuddy.io/invocation/8ca467d1-76dc-4239-89da-e18de78cae4f None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. 19:47:46.597 INFO: benchmark_serving: Namespace(arxiv_summarization_input_len=15000, backend='modular', base_url=None, batch_job_image_dir=None, burstiness=1.0, chat_warmup_delay_ms=0.0, collect_cpu_stats=True, collect_gpu_stats=False, collect_server_stats=False, config_file=None, dataset_mode=<DatasetMode.HUGGINGFACE: 'huggingface'>, dataset_name='random', dataset_path=None, delay_between_chat_turns=None, disable_tqdm=False, endpoint='/v1/chat/completions', host='localhost', ignore_first_turn_stats=False, lora=None, lora_output_dir='/tmp/loras', lora_paths=[], lora_rank=16, lora_request_ratio=0.5, lora_server_path='/tmp/loras', lora_target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj'], max_benchmark_duration_s=None, max_concurrency=None, max_concurrent_lora_ops=1, max_num_loras=10, max_output_len=None, metadata=[], model='Qwen/Qwen2.5-VL-32B-Instruct', model_max_length=None, num_chat_sessions=None, num_loras=0, num_prompts=1, obfuscated_conversations_average_output_len=175, obfuscated_conversations_coefficient_of_variation=0.1, obfuscated_conversations_shuffle=False, output_lengths=None, port=8000, print_inputs_and_outputs=False, random_coefficient_of_variation='0.1,0.6', random_distribution_type='normal', random_first_turn_ratio=1.0, random_image_count=6, random_image_size='512,512', random_input_len=40, random_max_num_unique_sys_prompt=1, random_num_turns=1, random_output_len=15, random_sys_prompt_ratio=0.0, record_output_lengths=None, request_rate='inf', result_dir=None, result_filename=None, save_result=False, seed=0, server_args='', skip_first_n_requests=0, skip_test_prompt=False, sonnet_input_len=550, sonnet_prefix_len=200, temperature=0.0, tokenizer=None, top_k=None, top_p=1.0, trust_remote_code=False) 19:47:46.597 INFO: benchmark_serving: getting tokenizer. api url: http://localhost:8000/v1/chat/completions 19:47:47.116 INFO: benchmark_serving: sampling requests 19:47:47.116 INFO: benchmark_shared.datasets.random: Random samples in normal distribution 19:47:47.169 INFO: benchmark_serving: starting benchmark run 19:47:47.170 INFO: benchmark_serving: Starting initial single prompt test run... 19:47:49.065 INFO: benchmark_serving: Initial test run completed. Starting main benchmark run... 19:47:49.065 INFO: benchmark_serving: Input request rate: inf 19:47:49.065 INFO: benchmark_serving: Burstiness factor: 1.0 (Poisson process) 19:47:49.065 INFO: benchmark_serving: Maximum request concurrency: None 100%|███████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.02s/it] ============ Serving Benchmark Result ============ Successful requests: 1 Failed requests: 0 Benchmark duration (s): 1.02 Total input tokens: 1583 Total generated tokens: 19 Total nonempty serving response chunks: 19 Input request rate (req/s): inf Request throughput (req/s): 0.97580 ------------Client Experience Metrics------------- Max Concurrency: 1 Mean input token throughput (tok/s): 2820.11 Std input token throughput (tok/s): 0.00 Median input token throughput (tok/s): 2820.11 P90 input token throughput (tok/s): 2820.11 P95 input token throughput (tok/s): 2820.11 P99 input token throughput (tok/s): 2820.11 Mean output token throughput (tok/s): 39.24 Std output token throughput (tok/s): 0.00 Median output token throughput (tok/s): 39.24 P90 output token throughput (tok/s): 39.24 P95 output token throughput (tok/s): 39.24 P99 output token throughput (tok/s): 39.24 ---------------Time to First Token---------------- Mean TTFT (ms): 561.33 Std TTFT (ms): 0.00 Median TTFT (ms): 561.33 P90 TTFT (ms): 561.33 P95 TTFT (ms): 561.33 P99 TTFT (ms): 561.33 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 25.48 Std TPOT (ms): 0.00 Median TPOT (ms): 25.48 P90 TPOT (ms): 25.48 P95 TPOT (ms): 25.48 P99 TPOT (ms): 25.48 ---------------Inter-token Latency---------------- Mean ITL (ms): 25.36 Std ITL (ms): 71.50 Median ITL (ms): 0.11 P90 ITL (ms): 65.15 P95 ITL (ms): 219.78 P99 ITL (ms): 234.45 -------------Per-Request E2E Latency-------------- Mean Request Latency (ms): 1019.98 Std Request Latency (ms): 0.00 Median Request Latency (ms): 1019.98 P90 Request Latency (ms): 1019.98 P95 Request Latency (ms): 1019.98 P99 Request Latency (ms): 1019.98 -------------------Token Stats-------------------- Max input tokens: 1583 Max output tokens: 19 Max total tokens: 1602 --------------------CPU Stats--------------------- CPU User Utilization (%): 202.88 CPU System Utilization (%): 5.85 ================================================== 19:47:50.126 INFO: benchmark_serving: finished benchmark run: Success. ``` MAX_PYTHON_ORIG_REV_ID: 0ca0f2ec8724d830dcd06af48862e2b981b3ddc5

comaniac requested review from WoosukKwon, alexm-redhat, njhill, robertgshaw2-redhat and ywang96 as code owners December 13, 2024 23:47

comaniac force-pushed the v1-vlm-cache branch from 3e212d9 to 615ca86 Compare December 14, 2024 01:24

ywang96 self-assigned this Dec 14, 2024

This was referenced Dec 14, 2024

[RFC]: Multi-modality Support on vLLM #4194

Open

[Core][VLM] Add support for prefix caching for multi-modal models #8348

Closed

DarkLight1337 reviewed Dec 14, 2024

View reviewed changes

vllm/v1/request.py Outdated Show resolved Hide resolved

ywang96 reviewed Dec 15, 2024

View reviewed changes

mergify bot added the needs-rebase label Dec 15, 2024

rickyyx suggested changes Dec 15, 2024

View reviewed changes

done

bddb2f0

Signed-off-by: Cody Yu <[email protected]>

comaniac force-pushed the v1-vlm-cache branch from 615ca86 to bddb2f0 Compare December 16, 2024 19:08

mergify bot removed the needs-rebase label Dec 16, 2024

comaniac added 2 commits December 16, 2024 19:20

oops

35b635b

Signed-off-by: Cody Yu <[email protected]>

mypy

9ea575f

Signed-off-by: Cody Yu <[email protected]>

comaniac force-pushed the v1-vlm-cache branch from 73d0dd3 to 9ea575f Compare December 16, 2024 19:20

comaniac added 3 commits December 16, 2024 21:53

fix bug

a09744f

Signed-off-by: Cody Yu <[email protected]>

Alex's patch

f0b4e99

Signed-off-by: Cody Yu <[email protected]>

config

a9516ba

Signed-off-by: Cody Yu <[email protected]>

comaniac added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 17, 2024

alexm-redhat approved these changes Dec 17, 2024

View reviewed changes

type

458f3e1

Signed-off-by: Cody Yu <[email protected]>

ywang96 approved these changes Dec 17, 2024

View reviewed changes

fix test

d267bd8

Signed-off-by: Cody Yu <[email protected]>

ruff

88f275c

Signed-off-by: Cody Yu <[email protected]>

rickyyx approved these changes Dec 17, 2024

View reviewed changes

fix

45c293b

Signed-off-by: Cody Yu <[email protected]>

simon-mo merged commit bf8717e into vllm-project:main Dec 18, 2024
52 of 54 checks passed

comaniac deleted the v1-vlm-cache branch December 18, 2024 01:00

ywang96 mentioned this pull request Dec 18, 2024

[V1] Fix multimodal profiling #11308

Closed

sleepwalker2017 mentioned this pull request Dec 20, 2024

[Bug]: Prefix caching doesn't work for LlavaOneVision #11371

Closed

1 task

heheda12345 mentioned this pull request Dec 31, 2024

[V1] Simpify vision block hash for prefix caching by removing offset from hash #11646

Merged

ywang96 changed the title ~~[V1] Prefix caching for vision language models~~ [V1] Prefix caching for multimodal language models Dec 31, 2024

mzusman pushed a commit to mzusman/vllm that referenced this pull request Mar 12, 2025

[V1] Prefix caching for vision language models (vllm-project#11187)

44efbcb

Signed-off-by: Cody Yu <[email protected]>

Uh oh!

[V1] Prefix caching for multimodal language models #11187

[V1] Prefix caching for multimodal language models #11187

Uh oh!

Conversation

comaniac commented Dec 13, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 13, 2024

Uh oh!

Uh oh!

ywang96 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Dec 15, 2024

Uh oh!

rickyyx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sleepwalker2017 commented Dec 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexm-redhat commented Dec 16, 2024

Uh oh!

comaniac commented Dec 16, 2024

Uh oh!

comaniac commented Dec 17, 2024

Uh oh!

alexm-redhat left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ywang96 left a comment

Choose a reason for hiding this comment

Uh oh!

comaniac commented Dec 17, 2024

Uh oh!

rickyyx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ywang96 commented Dec 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

comaniac commented Dec 13, 2024 •

edited by github-actions bot

Loading

sleepwalker2017 commented Dec 16, 2024 •

edited

Loading

ywang96 commented Dec 31, 2024 •

edited

Loading