-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[V1] Prefix caching for multimodal language models #11187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
3e212d9
to
615ca86
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@comaniac Thanks for this great work! Overall the code looks clean and I have left some comments. PTAL!
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really great! I mainly chimed in for some nits.
The only main question I have is on the generate_block_hash_extra_keys
routine, which I feel there's some potential to make it easier to reason with. But I might be overlooking some constraints that have driven its current impl.
export VLLM_USE_V1=1
If I run without this, it runs ok. |
@comaniac I will modify the code so you don't get an error without mm cache preprocessor. Will do it on your PR and send you the patch. |
Signed-off-by: Cody Yu <[email protected]>
615ca86
to
bddb2f0
Compare
Signed-off-by: Cody Yu <[email protected]>
Signed-off-by: Cody Yu <[email protected]>
73d0dd3
to
9ea575f
Compare
Signed-off-by: Cody Yu <[email protected]>
Signed-off-by: Cody Yu <[email protected]>
Signed-off-by: Cody Yu <[email protected]>
All comments should have been addressed. PTAL @ywang96 @alexm-neuralmagic @rickyyx. Highlights:
|
Note: CI failure is unrelated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! @comaniac thanks for making prefix caching work for VLMs! Just some nits
Signed-off-by: Cody Yu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I've shared some benchmark results on Slack.
The negative impact of APC is minimal at even 0% hit rate, so I think this PR is good to go!
Signed-off-by: Cody Yu <[email protected]>
I found that it's tricky to configure different the default value of prefix caching for MM, because we don't know what model to serve when creating engine config from CLI. So now I enable prefix caching by default for all models in v1. We should mention in the blogpost/announcement that if users encounter any errors with MM in v1, disabling prefix caching is one of the things they could try to workaround. |
Signed-off-by: Cody Yu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: Cody Yu <[email protected]>
Going to rename this PR to |
Signed-off-by: Cody Yu <[email protected]>
This adds prefix caching support for some VLM models. This has only been ad hoc tested with Qwen2.5VL. This should (hopefully) work with other VLM models like InternVL / Pixtral / Llama Vision with some additional effort. That will be done in followups. The implementation of prefix caching is inspired by vLLM's implementation (vllm-project/vllm#11187). However, we diverge slightly in our implementation. Say `<vision_token_id>` is 98, `<vision_start_token_id>` is 97, and `<vision_end_token_id>` is 99. We ignore the <vision_start_token_id> and <vision_end_token_id> tokens. Consider the following context: ``` |<-- img0 --->| |<--- img1 -->| 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 tokens = np.array([51, 52, 53, 54, 97, 98, 98, 98, 98, 99, 55, 56, 57, 58, 97, 98, 98, 98, 98, 99, 59, 60, 61, 62]) ``` Notice that this context has two images. - img0: [5,9) - img1: [14,19) Recipe: - First hash the pixel values for the two images. Say `hash(img_1) => 27712389489` and `hash(img_2) => -90834898922`. - Replace the first `<vision_token_id>` in the token array with the hash for that image. - Chunk the token array and compute hashes as usual. Eg: ``` |<------ img0 ------->| |<-------- img1 ------->| 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 tokens = np.array([51, 52, 53, 54, 97, 27712389489, 98, 98, 98, 99, 55, 56, 57, 58, 97, -90834898922, 98, 98, 98, 99, 59, 60, 61, 62]) ``` Say PageSize == 4: ``` hash1 = hash(None , [ 11 12 13 14 ]) => hasha hash2 = hash(hasha , [ 27712389489 999 999 999 ]) => hashb hash3 = hash(hashb , [ 999 1000 15 16 ]) => hashc ... ``` Notice that this allows us to get prefix cache hits in some prefix of the prompt, even if the request supplies a different image later in the prompt. Also this design trivially supports multiple images. WARNING!!!! This implementation has some limitations, namely that we will rerun vision encoding for ALL images in the prompt even if we get a cache hit on all tokens of some images. For example, if a req has img0 and img1 and we say we get cache hit on img0. In an ideal impl, we will run vision encoding just for img1. However, in this impl, we will run vision encoding on both img0 and img1. We then do a gather to get the subset of embeddings we need. Then we do a scatter operations on that subset of embeddings to update the text embs at the correct indices. If we need to encode > 1 image, we encode all N. If we need to encode 0 images, we encode 0 images. As such, this impl solely speeds up CE of text model but does not speed up vision encoding unless we get a cache hit on all images in prompt. The reason we do this is because it is sort of hard to do the right thing since we precompute the vision encoder inputs in the tokenizer. We need to fix our code to stop adding a million things to `extra_model_args`. This is extremely hard to work with. We end up needing to perform a masked scatter op to update the relevant rows of text embeddings corresponding to image tokens. The image embeddings that are unneeded due to prefix cache hits are masked, while the rest are scattered. This masked scatter is implemented via a gather -> scatter. The correctness bug I was hunting down ended up being due to a mis-preparation of the decoder_position_ids inputs. The "Recompute this value on the fly." branch of the if statement led me to believe I could just run that code to recompute the position ids on prefix cache hit. However, that is not the case as that branch only handles case where next_tokens has no image, like during TG. While hunting down that bug, I also noticed that we are likely not handling preemptions appropriately in the existing code. This may cause correctness issues after a preemption. ``` [M] ubuntu@workspace-bez-a100-57bf475f6-qmpjn:~/syncmodular$ br //:max --config disable-mypy -- serve --model Qwen/Qwen2.5-VL-32B-Instruct --device-memory-utilization 0.85 --no-enable-chunked-prefill --trust-remote-code --devices gpu:0,1 INFO: Invocation ID: 8090289f-32d9-463f-8e54-448fdfe21711 INFO: Streaming build results to: https://modular.buildbuddy.io/invocation/8090289f-32d9-463f-8e54-448fdfe21711 INFO: Analyzed target //:max (0 packages loaded, 2 targets and 1 aspect configured). INFO: Found 1 target... Target //SDK/lib/API/python/max/entrypoints:pipelines up-to-date: bazel-bin/SDK/lib/API/python/max/entrypoints/pipelines Aspect //bazel/internal:mypy.bzl%mypy_aspect of //:max up-to-date (nothing to build) INFO: Elapsed time: 1.569s, Critical Path: 0.14s INFO: 4 processes: 1 action cache hit, 4 internal. INFO: Build completed successfully, 4 total actions INFO: Streaming build results to: https://modular.buildbuddy.io/invocation/8090289f-32d9-463f-8e54-448fdfe21711 19:43:32.585 INFO: Metrics initialized. 19:43:40.215 INFO: 19:43:40.215 INFO: ============================================================ 19:43:40.215 INFO: Pipeline Configuration (use --pretty-print-config to print full config) 19:43:40.215 INFO: ============================================================ 19:43:40.215 INFO: model : Qwen/Qwen2.5-VL-32B-Instruct 19:43:40.215 INFO: architecture : Qwen2_5_VLForConditionalGeneration 19:43:40.215 INFO: pipeline : TextGenerationPipeline 19:43:40.215 INFO: devices : gpu[0], gpu[1] 19:43:40.215 INFO: max_batch_size : 512 19:43:40.215 INFO: max_seq_len : 128000 19:43:40.215 INFO: cache_memory : 64.09 GiB 19:43:40.215 INFO: 19:43:41.393 INFO: Starting server... Starting ASGI metrics server on port 8001 19:43:46.070 INFO: Metrics initialized. 19:43:46.662 INFO: Starting download of model: Qwen/Qwen2.5-VL-32B-Instruct 100%|██████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 72.36it/s] 19:43:46.919 INFO: Finished download of model: Qwen/Qwen2.5-VL-32B-Instruct in 0.256844 seconds. 19:44:06.058 INFO: Building and compiling vision model... 19:45:39.941 INFO: Building and compiling vision model took 93.882900 seconds 19:45:39.941 INFO: Building and compiling language model... 19:47:20.563 INFO: Building and compiling language model took 100.621737 seconds 19:47:27.918 INFO: ******************************************************************************** 🚀 Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit) ******************************************************************************** 19:47:48.497 INFO: Executed CE batch with 1 reqs | Terminated: 0 reqs, Pending: 0 reqs | Input Tokens: 2022/8192 toks | Prompt Tput: 1.9K tok/s, Generation Tput: 0.9 tok/s | Batch creation: 285.01us, Execution: 1.08s | KVCache usage: 0.8% of 2050 blocks | Cache hit rate: 0.0% | All Preemptions: 0 reqs 19:47:48.740 INFO: Executed TG batch with 1 reqs | Terminated: 0 reqs, Pending: 0 reqs | Input Tokens: 1/INF toks | Prompt Tput: 4.1 tok/s, Generation Tput: 41.1 tok/s | Batch creation: 66.45us, Execution: 243.34ms | KVCache usage: 0.8% of 2050 blocks | Cache hit rate: 0.0% | All Preemptions: 0 reqs 19:47:49.051 INFO: Executed TG batch with 1 reqs | Terminated: 1 reqs, Pending: 0 reqs | Input Tokens: 1/INF toks | Prompt Tput: 3.2 tok/s, Generation Tput: 29.0 tok/s | Batch creation: 38.93us, Execution: 310.24ms | KVCache usage: 0.0% of 2050 blocks | Cache hit rate: 0.0% | All Preemptions: 0 reqs 19:47:49.656 INFO: Executed CE batch with 1 reqs | Terminated: 0 reqs, Pending: 0 reqs | Input Tokens: 102/8192 toks | Prompt Tput: 265.2 tok/s, Generation Tput: 2.6 tok/s | Batch creation: 199.94us, Execution: 384.66ms | KVCache usage: 0.8% of 2050 blocks | Cache hit rate: 95.0% | All Preemptions: 0 reqs 19:47:49.896 INFO: Executed TG batch with 1 reqs | Terminated: 0 reqs, Pending: 0 reqs | Input Tokens: 1/INF toks | Prompt Tput: 4.2 tok/s, Generation Tput: 41.8 tok/s | Batch creation: 41.60us, Execution: 239.10ms | KVCache usage: 0.8% of 2050 blocks | Cache hit rate: 0.0% | All Preemptions: 0 reqs 19:47:50.111 INFO: Executed TG batch with 1 reqs | Terminated: 1 reqs, Pending: 0 reqs | Input Tokens: 1/INF toks | Prompt Tput: 4.6 tok/s, Generation Tput: 41.8 tok/s | Batch creation: 35.55us, Execution: 215.12ms | KVCache usage: 0.0% of 2050 blocks | Cache hit rate: 0.0% | All Preemptions: 0 reqs ``` The relevant log is: ``` Cache hit rate: 95.0% ``` ``` [M] ubuntu@workspace-bez-a100-57bf475f6-qmpjn:~/syncmodular$ br //open-source/max/benchmark:benchmark_serving --config disable-mypy -- --backend modular --model Qwen/Qwen2.5-VL-32B-Instruct --endpoint /v1/chat/completions --host localhost --port 8000 --dataset-name random --random-input-len 40 --random-output-len 15 --random-image-size 512,512 --random-image-count=6 --random-coefficient-of-variation 0.1,0.6 --num-prompts 1 INFO: Invocation ID: 8ca467d1-76dc-4239-89da-e18de78cae4f INFO: Streaming build results to: https://modular.buildbuddy.io/invocation/8ca467d1-76dc-4239-89da-e18de78cae4f INFO: Analyzed target //open-source/max/benchmark:benchmark_serving (0 packages loaded, 4 targets and 1 aspect configured). INFO: Found 1 target... Target //open-source/max/benchmark:benchmark_serving up-to-date: bazel-bin/open-source/max/benchmark/benchmark_serving Aspect //bazel/internal:mypy.bzl%mypy_aspect of //open-source/max/benchmark:benchmark_serving up-to-date (nothing to build) INFO: Elapsed time: 0.900s, Critical Path: 0.09s INFO: 5 processes: 1 action cache hit, 5 internal. INFO: Build completed successfully, 5 total actions INFO: Streaming build results to: https://modular.buildbuddy.io/invocation/8ca467d1-76dc-4239-89da-e18de78cae4f None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. 19:47:46.597 INFO: benchmark_serving: Namespace(arxiv_summarization_input_len=15000, backend='modular', base_url=None, batch_job_image_dir=None, burstiness=1.0, chat_warmup_delay_ms=0.0, collect_cpu_stats=True, collect_gpu_stats=False, collect_server_stats=False, config_file=None, dataset_mode=<DatasetMode.HUGGINGFACE: 'huggingface'>, dataset_name='random', dataset_path=None, delay_between_chat_turns=None, disable_tqdm=False, endpoint='/v1/chat/completions', host='localhost', ignore_first_turn_stats=False, lora=None, lora_output_dir='/tmp/loras', lora_paths=[], lora_rank=16, lora_request_ratio=0.5, lora_server_path='/tmp/loras', lora_target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj'], max_benchmark_duration_s=None, max_concurrency=None, max_concurrent_lora_ops=1, max_num_loras=10, max_output_len=None, metadata=[], model='Qwen/Qwen2.5-VL-32B-Instruct', model_max_length=None, num_chat_sessions=None, num_loras=0, num_prompts=1, obfuscated_conversations_average_output_len=175, obfuscated_conversations_coefficient_of_variation=0.1, obfuscated_conversations_shuffle=False, output_lengths=None, port=8000, print_inputs_and_outputs=False, random_coefficient_of_variation='0.1,0.6', random_distribution_type='normal', random_first_turn_ratio=1.0, random_image_count=6, random_image_size='512,512', random_input_len=40, random_max_num_unique_sys_prompt=1, random_num_turns=1, random_output_len=15, random_sys_prompt_ratio=0.0, record_output_lengths=None, request_rate='inf', result_dir=None, result_filename=None, save_result=False, seed=0, server_args='', skip_first_n_requests=0, skip_test_prompt=False, sonnet_input_len=550, sonnet_prefix_len=200, temperature=0.0, tokenizer=None, top_k=None, top_p=1.0, trust_remote_code=False) 19:47:46.597 INFO: benchmark_serving: getting tokenizer. api url: http://localhost:8000/v1/chat/completions 19:47:47.116 INFO: benchmark_serving: sampling requests 19:47:47.116 INFO: benchmark_shared.datasets.random: Random samples in normal distribution 19:47:47.169 INFO: benchmark_serving: starting benchmark run 19:47:47.170 INFO: benchmark_serving: Starting initial single prompt test run... 19:47:49.065 INFO: benchmark_serving: Initial test run completed. Starting main benchmark run... 19:47:49.065 INFO: benchmark_serving: Input request rate: inf 19:47:49.065 INFO: benchmark_serving: Burstiness factor: 1.0 (Poisson process) 19:47:49.065 INFO: benchmark_serving: Maximum request concurrency: None 100%|███████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.02s/it] ============ Serving Benchmark Result ============ Successful requests: 1 Failed requests: 0 Benchmark duration (s): 1.02 Total input tokens: 1583 Total generated tokens: 19 Total nonempty serving response chunks: 19 Input request rate (req/s): inf Request throughput (req/s): 0.97580 ------------Client Experience Metrics------------- Max Concurrency: 1 Mean input token throughput (tok/s): 2820.11 Std input token throughput (tok/s): 0.00 Median input token throughput (tok/s): 2820.11 P90 input token throughput (tok/s): 2820.11 P95 input token throughput (tok/s): 2820.11 P99 input token throughput (tok/s): 2820.11 Mean output token throughput (tok/s): 39.24 Std output token throughput (tok/s): 0.00 Median output token throughput (tok/s): 39.24 P90 output token throughput (tok/s): 39.24 P95 output token throughput (tok/s): 39.24 P99 output token throughput (tok/s): 39.24 ---------------Time to First Token---------------- Mean TTFT (ms): 561.33 Std TTFT (ms): 0.00 Median TTFT (ms): 561.33 P90 TTFT (ms): 561.33 P95 TTFT (ms): 561.33 P99 TTFT (ms): 561.33 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 25.48 Std TPOT (ms): 0.00 Median TPOT (ms): 25.48 P90 TPOT (ms): 25.48 P95 TPOT (ms): 25.48 P99 TPOT (ms): 25.48 ---------------Inter-token Latency---------------- Mean ITL (ms): 25.36 Std ITL (ms): 71.50 Median ITL (ms): 0.11 P90 ITL (ms): 65.15 P95 ITL (ms): 219.78 P99 ITL (ms): 234.45 -------------Per-Request E2E Latency-------------- Mean Request Latency (ms): 1019.98 Std Request Latency (ms): 0.00 Median Request Latency (ms): 1019.98 P90 Request Latency (ms): 1019.98 P95 Request Latency (ms): 1019.98 P99 Request Latency (ms): 1019.98 -------------------Token Stats-------------------- Max input tokens: 1583 Max output tokens: 19 Max total tokens: 1602 --------------------CPU Stats--------------------- CPU User Utilization (%): 202.88 CPU System Utilization (%): 5.85 ================================================== 19:47:50.126 INFO: benchmark_serving: finished benchmark run: Success. ``` MODULAR_ORIG_COMMIT_REV_ID: 0ca0f2ec8724d830dcd06af48862e2b981b3ddc5
This adds prefix caching support for some VLM models. This has only been ad hoc tested with Qwen2.5VL. This should (hopefully) work with other VLM models like InternVL / Pixtral / Llama Vision with some additional effort. That will be done in followups. The implementation of prefix caching is inspired by vLLM's implementation (vllm-project/vllm#11187). However, we diverge slightly in our implementation. Say `<vision_token_id>` is 98, `<vision_start_token_id>` is 97, and `<vision_end_token_id>` is 99. We ignore the <vision_start_token_id> and <vision_end_token_id> tokens. Consider the following context: ``` |<-- img0 --->| |<--- img1 -->| 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 tokens = np.array([51, 52, 53, 54, 97, 98, 98, 98, 98, 99, 55, 56, 57, 58, 97, 98, 98, 98, 98, 99, 59, 60, 61, 62]) ``` Notice that this context has two images. - img0: [5,9) - img1: [14,19) Recipe: - First hash the pixel values for the two images. Say `hash(img_1) => 27712389489` and `hash(img_2) => -90834898922`. - Replace the first `<vision_token_id>` in the token array with the hash for that image. - Chunk the token array and compute hashes as usual. Eg: ``` |<------ img0 ------->| |<-------- img1 ------->| 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 tokens = np.array([51, 52, 53, 54, 97, 27712389489, 98, 98, 98, 99, 55, 56, 57, 58, 97, -90834898922, 98, 98, 98, 99, 59, 60, 61, 62]) ``` Say PageSize == 4: ``` hash1 = hash(None , [ 11 12 13 14 ]) => hasha hash2 = hash(hasha , [ 27712389489 999 999 999 ]) => hashb hash3 = hash(hashb , [ 999 1000 15 16 ]) => hashc ... ``` Notice that this allows us to get prefix cache hits in some prefix of the prompt, even if the request supplies a different image later in the prompt. Also this design trivially supports multiple images. WARNING!!!! This implementation has some limitations, namely that we will rerun vision encoding for ALL images in the prompt even if we get a cache hit on all tokens of some images. For example, if a req has img0 and img1 and we say we get cache hit on img0. In an ideal impl, we will run vision encoding just for img1. However, in this impl, we will run vision encoding on both img0 and img1. We then do a gather to get the subset of embeddings we need. Then we do a scatter operations on that subset of embeddings to update the text embs at the correct indices. If we need to encode > 1 image, we encode all N. If we need to encode 0 images, we encode 0 images. As such, this impl solely speeds up CE of text model but does not speed up vision encoding unless we get a cache hit on all images in prompt. The reason we do this is because it is sort of hard to do the right thing since we precompute the vision encoder inputs in the tokenizer. We need to fix our code to stop adding a million things to `extra_model_args`. This is extremely hard to work with. We end up needing to perform a masked scatter op to update the relevant rows of text embeddings corresponding to image tokens. The image embeddings that are unneeded due to prefix cache hits are masked, while the rest are scattered. This masked scatter is implemented via a gather -> scatter. The correctness bug I was hunting down ended up being due to a mis-preparation of the decoder_position_ids inputs. The "Recompute this value on the fly." branch of the if statement led me to believe I could just run that code to recompute the position ids on prefix cache hit. However, that is not the case as that branch only handles case where next_tokens has no image, like during TG. While hunting down that bug, I also noticed that we are likely not handling preemptions appropriately in the existing code. This may cause correctness issues after a preemption. ``` [M] ubuntu@workspace-bez-a100-57bf475f6-qmpjn:~/syncmodular$ br //:max --config disable-mypy -- serve --model Qwen/Qwen2.5-VL-32B-Instruct --device-memory-utilization 0.85 --no-enable-chunked-prefill --trust-remote-code --devices gpu:0,1 INFO: Invocation ID: 8090289f-32d9-463f-8e54-448fdfe21711 INFO: Streaming build results to: https://modular.buildbuddy.io/invocation/8090289f-32d9-463f-8e54-448fdfe21711 INFO: Analyzed target //:max (0 packages loaded, 2 targets and 1 aspect configured). INFO: Found 1 target... Target //SDK/lib/API/python/max/entrypoints:pipelines up-to-date: bazel-bin/SDK/lib/API/python/max/entrypoints/pipelines Aspect //bazel/internal:mypy.bzl%mypy_aspect of //:max up-to-date (nothing to build) INFO: Elapsed time: 1.569s, Critical Path: 0.14s INFO: 4 processes: 1 action cache hit, 4 internal. INFO: Build completed successfully, 4 total actions INFO: Streaming build results to: https://modular.buildbuddy.io/invocation/8090289f-32d9-463f-8e54-448fdfe21711 19:43:32.585 INFO: Metrics initialized. 19:43:40.215 INFO: 19:43:40.215 INFO: ============================================================ 19:43:40.215 INFO: Pipeline Configuration (use --pretty-print-config to print full config) 19:43:40.215 INFO: ============================================================ 19:43:40.215 INFO: model : Qwen/Qwen2.5-VL-32B-Instruct 19:43:40.215 INFO: architecture : Qwen2_5_VLForConditionalGeneration 19:43:40.215 INFO: pipeline : TextGenerationPipeline 19:43:40.215 INFO: devices : gpu[0], gpu[1] 19:43:40.215 INFO: max_batch_size : 512 19:43:40.215 INFO: max_seq_len : 128000 19:43:40.215 INFO: cache_memory : 64.09 GiB 19:43:40.215 INFO: 19:43:41.393 INFO: Starting server... Starting ASGI metrics server on port 8001 19:43:46.070 INFO: Metrics initialized. 19:43:46.662 INFO: Starting download of model: Qwen/Qwen2.5-VL-32B-Instruct 100%|██████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 72.36it/s] 19:43:46.919 INFO: Finished download of model: Qwen/Qwen2.5-VL-32B-Instruct in 0.256844 seconds. 19:44:06.058 INFO: Building and compiling vision model... 19:45:39.941 INFO: Building and compiling vision model took 93.882900 seconds 19:45:39.941 INFO: Building and compiling language model... 19:47:20.563 INFO: Building and compiling language model took 100.621737 seconds 19:47:27.918 INFO: ******************************************************************************** 🚀 Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit) ******************************************************************************** 19:47:48.497 INFO: Executed CE batch with 1 reqs | Terminated: 0 reqs, Pending: 0 reqs | Input Tokens: 2022/8192 toks | Prompt Tput: 1.9K tok/s, Generation Tput: 0.9 tok/s | Batch creation: 285.01us, Execution: 1.08s | KVCache usage: 0.8% of 2050 blocks | Cache hit rate: 0.0% | All Preemptions: 0 reqs 19:47:48.740 INFO: Executed TG batch with 1 reqs | Terminated: 0 reqs, Pending: 0 reqs | Input Tokens: 1/INF toks | Prompt Tput: 4.1 tok/s, Generation Tput: 41.1 tok/s | Batch creation: 66.45us, Execution: 243.34ms | KVCache usage: 0.8% of 2050 blocks | Cache hit rate: 0.0% | All Preemptions: 0 reqs 19:47:49.051 INFO: Executed TG batch with 1 reqs | Terminated: 1 reqs, Pending: 0 reqs | Input Tokens: 1/INF toks | Prompt Tput: 3.2 tok/s, Generation Tput: 29.0 tok/s | Batch creation: 38.93us, Execution: 310.24ms | KVCache usage: 0.0% of 2050 blocks | Cache hit rate: 0.0% | All Preemptions: 0 reqs 19:47:49.656 INFO: Executed CE batch with 1 reqs | Terminated: 0 reqs, Pending: 0 reqs | Input Tokens: 102/8192 toks | Prompt Tput: 265.2 tok/s, Generation Tput: 2.6 tok/s | Batch creation: 199.94us, Execution: 384.66ms | KVCache usage: 0.8% of 2050 blocks | Cache hit rate: 95.0% | All Preemptions: 0 reqs 19:47:49.896 INFO: Executed TG batch with 1 reqs | Terminated: 0 reqs, Pending: 0 reqs | Input Tokens: 1/INF toks | Prompt Tput: 4.2 tok/s, Generation Tput: 41.8 tok/s | Batch creation: 41.60us, Execution: 239.10ms | KVCache usage: 0.8% of 2050 blocks | Cache hit rate: 0.0% | All Preemptions: 0 reqs 19:47:50.111 INFO: Executed TG batch with 1 reqs | Terminated: 1 reqs, Pending: 0 reqs | Input Tokens: 1/INF toks | Prompt Tput: 4.6 tok/s, Generation Tput: 41.8 tok/s | Batch creation: 35.55us, Execution: 215.12ms | KVCache usage: 0.0% of 2050 blocks | Cache hit rate: 0.0% | All Preemptions: 0 reqs ``` The relevant log is: ``` Cache hit rate: 95.0% ``` ``` [M] ubuntu@workspace-bez-a100-57bf475f6-qmpjn:~/syncmodular$ br //open-source/max/benchmark:benchmark_serving --config disable-mypy -- --backend modular --model Qwen/Qwen2.5-VL-32B-Instruct --endpoint /v1/chat/completions --host localhost --port 8000 --dataset-name random --random-input-len 40 --random-output-len 15 --random-image-size 512,512 --random-image-count=6 --random-coefficient-of-variation 0.1,0.6 --num-prompts 1 INFO: Invocation ID: 8ca467d1-76dc-4239-89da-e18de78cae4f INFO: Streaming build results to: https://modular.buildbuddy.io/invocation/8ca467d1-76dc-4239-89da-e18de78cae4f INFO: Analyzed target //open-source/max/benchmark:benchmark_serving (0 packages loaded, 4 targets and 1 aspect configured). INFO: Found 1 target... Target //open-source/max/benchmark:benchmark_serving up-to-date: bazel-bin/open-source/max/benchmark/benchmark_serving Aspect //bazel/internal:mypy.bzl%mypy_aspect of //open-source/max/benchmark:benchmark_serving up-to-date (nothing to build) INFO: Elapsed time: 0.900s, Critical Path: 0.09s INFO: 5 processes: 1 action cache hit, 5 internal. INFO: Build completed successfully, 5 total actions INFO: Streaming build results to: https://modular.buildbuddy.io/invocation/8ca467d1-76dc-4239-89da-e18de78cae4f None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. 19:47:46.597 INFO: benchmark_serving: Namespace(arxiv_summarization_input_len=15000, backend='modular', base_url=None, batch_job_image_dir=None, burstiness=1.0, chat_warmup_delay_ms=0.0, collect_cpu_stats=True, collect_gpu_stats=False, collect_server_stats=False, config_file=None, dataset_mode=<DatasetMode.HUGGINGFACE: 'huggingface'>, dataset_name='random', dataset_path=None, delay_between_chat_turns=None, disable_tqdm=False, endpoint='/v1/chat/completions', host='localhost', ignore_first_turn_stats=False, lora=None, lora_output_dir='/tmp/loras', lora_paths=[], lora_rank=16, lora_request_ratio=0.5, lora_server_path='/tmp/loras', lora_target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj'], max_benchmark_duration_s=None, max_concurrency=None, max_concurrent_lora_ops=1, max_num_loras=10, max_output_len=None, metadata=[], model='Qwen/Qwen2.5-VL-32B-Instruct', model_max_length=None, num_chat_sessions=None, num_loras=0, num_prompts=1, obfuscated_conversations_average_output_len=175, obfuscated_conversations_coefficient_of_variation=0.1, obfuscated_conversations_shuffle=False, output_lengths=None, port=8000, print_inputs_and_outputs=False, random_coefficient_of_variation='0.1,0.6', random_distribution_type='normal', random_first_turn_ratio=1.0, random_image_count=6, random_image_size='512,512', random_input_len=40, random_max_num_unique_sys_prompt=1, random_num_turns=1, random_output_len=15, random_sys_prompt_ratio=0.0, record_output_lengths=None, request_rate='inf', result_dir=None, result_filename=None, save_result=False, seed=0, server_args='', skip_first_n_requests=0, skip_test_prompt=False, sonnet_input_len=550, sonnet_prefix_len=200, temperature=0.0, tokenizer=None, top_k=None, top_p=1.0, trust_remote_code=False) 19:47:46.597 INFO: benchmark_serving: getting tokenizer. api url: http://localhost:8000/v1/chat/completions 19:47:47.116 INFO: benchmark_serving: sampling requests 19:47:47.116 INFO: benchmark_shared.datasets.random: Random samples in normal distribution 19:47:47.169 INFO: benchmark_serving: starting benchmark run 19:47:47.170 INFO: benchmark_serving: Starting initial single prompt test run... 19:47:49.065 INFO: benchmark_serving: Initial test run completed. Starting main benchmark run... 19:47:49.065 INFO: benchmark_serving: Input request rate: inf 19:47:49.065 INFO: benchmark_serving: Burstiness factor: 1.0 (Poisson process) 19:47:49.065 INFO: benchmark_serving: Maximum request concurrency: None 100%|███████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.02s/it] ============ Serving Benchmark Result ============ Successful requests: 1 Failed requests: 0 Benchmark duration (s): 1.02 Total input tokens: 1583 Total generated tokens: 19 Total nonempty serving response chunks: 19 Input request rate (req/s): inf Request throughput (req/s): 0.97580 ------------Client Experience Metrics------------- Max Concurrency: 1 Mean input token throughput (tok/s): 2820.11 Std input token throughput (tok/s): 0.00 Median input token throughput (tok/s): 2820.11 P90 input token throughput (tok/s): 2820.11 P95 input token throughput (tok/s): 2820.11 P99 input token throughput (tok/s): 2820.11 Mean output token throughput (tok/s): 39.24 Std output token throughput (tok/s): 0.00 Median output token throughput (tok/s): 39.24 P90 output token throughput (tok/s): 39.24 P95 output token throughput (tok/s): 39.24 P99 output token throughput (tok/s): 39.24 ---------------Time to First Token---------------- Mean TTFT (ms): 561.33 Std TTFT (ms): 0.00 Median TTFT (ms): 561.33 P90 TTFT (ms): 561.33 P95 TTFT (ms): 561.33 P99 TTFT (ms): 561.33 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 25.48 Std TPOT (ms): 0.00 Median TPOT (ms): 25.48 P90 TPOT (ms): 25.48 P95 TPOT (ms): 25.48 P99 TPOT (ms): 25.48 ---------------Inter-token Latency---------------- Mean ITL (ms): 25.36 Std ITL (ms): 71.50 Median ITL (ms): 0.11 P90 ITL (ms): 65.15 P95 ITL (ms): 219.78 P99 ITL (ms): 234.45 -------------Per-Request E2E Latency-------------- Mean Request Latency (ms): 1019.98 Std Request Latency (ms): 0.00 Median Request Latency (ms): 1019.98 P90 Request Latency (ms): 1019.98 P95 Request Latency (ms): 1019.98 P99 Request Latency (ms): 1019.98 -------------------Token Stats-------------------- Max input tokens: 1583 Max output tokens: 19 Max total tokens: 1602 --------------------CPU Stats--------------------- CPU User Utilization (%): 202.88 CPU System Utilization (%): 5.85 ================================================== 19:47:50.126 INFO: benchmark_serving: finished benchmark run: Success. ``` MAX_PYTHON_ORIG_REV_ID: 0ca0f2ec8724d830dcd06af48862e2b981b3ddc5
This adds prefix caching support for some VLM models. This has only been ad hoc tested with Qwen2.5VL. This should (hopefully) work with other VLM models like InternVL / Pixtral / Llama Vision with some additional effort. That will be done in followups. The implementation of prefix caching is inspired by vLLM's implementation (vllm-project/vllm#11187). However, we diverge slightly in our implementation. Say `<vision_token_id>` is 98, `<vision_start_token_id>` is 97, and `<vision_end_token_id>` is 99. We ignore the <vision_start_token_id> and <vision_end_token_id> tokens. Consider the following context: ``` |<-- img0 --->| |<--- img1 -->| 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 tokens = np.array([51, 52, 53, 54, 97, 98, 98, 98, 98, 99, 55, 56, 57, 58, 97, 98, 98, 98, 98, 99, 59, 60, 61, 62]) ``` Notice that this context has two images. - img0: [5,9) - img1: [14,19) Recipe: - First hash the pixel values for the two images. Say `hash(img_1) => 27712389489` and `hash(img_2) => -90834898922`. - Replace the first `<vision_token_id>` in the token array with the hash for that image. - Chunk the token array and compute hashes as usual. Eg: ``` |<------ img0 ------->| |<-------- img1 ------->| 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 tokens = np.array([51, 52, 53, 54, 97, 27712389489, 98, 98, 98, 99, 55, 56, 57, 58, 97, -90834898922, 98, 98, 98, 99, 59, 60, 61, 62]) ``` Say PageSize == 4: ``` hash1 = hash(None , [ 11 12 13 14 ]) => hasha hash2 = hash(hasha , [ 27712389489 999 999 999 ]) => hashb hash3 = hash(hashb , [ 999 1000 15 16 ]) => hashc ... ``` Notice that this allows us to get prefix cache hits in some prefix of the prompt, even if the request supplies a different image later in the prompt. Also this design trivially supports multiple images. WARNING!!!! This implementation has some limitations, namely that we will rerun vision encoding for ALL images in the prompt even if we get a cache hit on all tokens of some images. For example, if a req has img0 and img1 and we say we get cache hit on img0. In an ideal impl, we will run vision encoding just for img1. However, in this impl, we will run vision encoding on both img0 and img1. We then do a gather to get the subset of embeddings we need. Then we do a scatter operations on that subset of embeddings to update the text embs at the correct indices. If we need to encode > 1 image, we encode all N. If we need to encode 0 images, we encode 0 images. As such, this impl solely speeds up CE of text model but does not speed up vision encoding unless we get a cache hit on all images in prompt. The reason we do this is because it is sort of hard to do the right thing since we precompute the vision encoder inputs in the tokenizer. We need to fix our code to stop adding a million things to `extra_model_args`. This is extremely hard to work with. We end up needing to perform a masked scatter op to update the relevant rows of text embeddings corresponding to image tokens. The image embeddings that are unneeded due to prefix cache hits are masked, while the rest are scattered. This masked scatter is implemented via a gather -> scatter. The correctness bug I was hunting down ended up being due to a mis-preparation of the decoder_position_ids inputs. The "Recompute this value on the fly." branch of the if statement led me to believe I could just run that code to recompute the position ids on prefix cache hit. However, that is not the case as that branch only handles case where next_tokens has no image, like during TG. While hunting down that bug, I also noticed that we are likely not handling preemptions appropriately in the existing code. This may cause correctness issues after a preemption. ``` [M] ubuntu@workspace-bez-a100-57bf475f6-qmpjn:~/syncmodular$ br //:max --config disable-mypy -- serve --model Qwen/Qwen2.5-VL-32B-Instruct --device-memory-utilization 0.85 --no-enable-chunked-prefill --trust-remote-code --devices gpu:0,1 INFO: Invocation ID: 8090289f-32d9-463f-8e54-448fdfe21711 INFO: Streaming build results to: https://modular.buildbuddy.io/invocation/8090289f-32d9-463f-8e54-448fdfe21711 INFO: Analyzed target //:max (0 packages loaded, 2 targets and 1 aspect configured). INFO: Found 1 target... Target //SDK/lib/API/python/max/entrypoints:pipelines up-to-date: bazel-bin/SDK/lib/API/python/max/entrypoints/pipelines Aspect //bazel/internal:mypy.bzl%mypy_aspect of //:max up-to-date (nothing to build) INFO: Elapsed time: 1.569s, Critical Path: 0.14s INFO: 4 processes: 1 action cache hit, 4 internal. INFO: Build completed successfully, 4 total actions INFO: Streaming build results to: https://modular.buildbuddy.io/invocation/8090289f-32d9-463f-8e54-448fdfe21711 19:43:32.585 INFO: Metrics initialized. 19:43:40.215 INFO: 19:43:40.215 INFO: ============================================================ 19:43:40.215 INFO: Pipeline Configuration (use --pretty-print-config to print full config) 19:43:40.215 INFO: ============================================================ 19:43:40.215 INFO: model : Qwen/Qwen2.5-VL-32B-Instruct 19:43:40.215 INFO: architecture : Qwen2_5_VLForConditionalGeneration 19:43:40.215 INFO: pipeline : TextGenerationPipeline 19:43:40.215 INFO: devices : gpu[0], gpu[1] 19:43:40.215 INFO: max_batch_size : 512 19:43:40.215 INFO: max_seq_len : 128000 19:43:40.215 INFO: cache_memory : 64.09 GiB 19:43:40.215 INFO: 19:43:41.393 INFO: Starting server... Starting ASGI metrics server on port 8001 19:43:46.070 INFO: Metrics initialized. 19:43:46.662 INFO: Starting download of model: Qwen/Qwen2.5-VL-32B-Instruct 100%|██████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 72.36it/s] 19:43:46.919 INFO: Finished download of model: Qwen/Qwen2.5-VL-32B-Instruct in 0.256844 seconds. 19:44:06.058 INFO: Building and compiling vision model... 19:45:39.941 INFO: Building and compiling vision model took 93.882900 seconds 19:45:39.941 INFO: Building and compiling language model... 19:47:20.563 INFO: Building and compiling language model took 100.621737 seconds 19:47:27.918 INFO: ******************************************************************************** 🚀 Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit) ******************************************************************************** 19:47:48.497 INFO: Executed CE batch with 1 reqs | Terminated: 0 reqs, Pending: 0 reqs | Input Tokens: 2022/8192 toks | Prompt Tput: 1.9K tok/s, Generation Tput: 0.9 tok/s | Batch creation: 285.01us, Execution: 1.08s | KVCache usage: 0.8% of 2050 blocks | Cache hit rate: 0.0% | All Preemptions: 0 reqs 19:47:48.740 INFO: Executed TG batch with 1 reqs | Terminated: 0 reqs, Pending: 0 reqs | Input Tokens: 1/INF toks | Prompt Tput: 4.1 tok/s, Generation Tput: 41.1 tok/s | Batch creation: 66.45us, Execution: 243.34ms | KVCache usage: 0.8% of 2050 blocks | Cache hit rate: 0.0% | All Preemptions: 0 reqs 19:47:49.051 INFO: Executed TG batch with 1 reqs | Terminated: 1 reqs, Pending: 0 reqs | Input Tokens: 1/INF toks | Prompt Tput: 3.2 tok/s, Generation Tput: 29.0 tok/s | Batch creation: 38.93us, Execution: 310.24ms | KVCache usage: 0.0% of 2050 blocks | Cache hit rate: 0.0% | All Preemptions: 0 reqs 19:47:49.656 INFO: Executed CE batch with 1 reqs | Terminated: 0 reqs, Pending: 0 reqs | Input Tokens: 102/8192 toks | Prompt Tput: 265.2 tok/s, Generation Tput: 2.6 tok/s | Batch creation: 199.94us, Execution: 384.66ms | KVCache usage: 0.8% of 2050 blocks | Cache hit rate: 95.0% | All Preemptions: 0 reqs 19:47:49.896 INFO: Executed TG batch with 1 reqs | Terminated: 0 reqs, Pending: 0 reqs | Input Tokens: 1/INF toks | Prompt Tput: 4.2 tok/s, Generation Tput: 41.8 tok/s | Batch creation: 41.60us, Execution: 239.10ms | KVCache usage: 0.8% of 2050 blocks | Cache hit rate: 0.0% | All Preemptions: 0 reqs 19:47:50.111 INFO: Executed TG batch with 1 reqs | Terminated: 1 reqs, Pending: 0 reqs | Input Tokens: 1/INF toks | Prompt Tput: 4.6 tok/s, Generation Tput: 41.8 tok/s | Batch creation: 35.55us, Execution: 215.12ms | KVCache usage: 0.0% of 2050 blocks | Cache hit rate: 0.0% | All Preemptions: 0 reqs ``` The relevant log is: ``` Cache hit rate: 95.0% ``` ``` [M] ubuntu@workspace-bez-a100-57bf475f6-qmpjn:~/syncmodular$ br //open-source/max/benchmark:benchmark_serving --config disable-mypy -- --backend modular --model Qwen/Qwen2.5-VL-32B-Instruct --endpoint /v1/chat/completions --host localhost --port 8000 --dataset-name random --random-input-len 40 --random-output-len 15 --random-image-size 512,512 --random-image-count=6 --random-coefficient-of-variation 0.1,0.6 --num-prompts 1 INFO: Invocation ID: 8ca467d1-76dc-4239-89da-e18de78cae4f INFO: Streaming build results to: https://modular.buildbuddy.io/invocation/8ca467d1-76dc-4239-89da-e18de78cae4f INFO: Analyzed target //open-source/max/benchmark:benchmark_serving (0 packages loaded, 4 targets and 1 aspect configured). INFO: Found 1 target... Target //open-source/max/benchmark:benchmark_serving up-to-date: bazel-bin/open-source/max/benchmark/benchmark_serving Aspect //bazel/internal:mypy.bzl%mypy_aspect of //open-source/max/benchmark:benchmark_serving up-to-date (nothing to build) INFO: Elapsed time: 0.900s, Critical Path: 0.09s INFO: 5 processes: 1 action cache hit, 5 internal. INFO: Build completed successfully, 5 total actions INFO: Streaming build results to: https://modular.buildbuddy.io/invocation/8ca467d1-76dc-4239-89da-e18de78cae4f None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. 19:47:46.597 INFO: benchmark_serving: Namespace(arxiv_summarization_input_len=15000, backend='modular', base_url=None, batch_job_image_dir=None, burstiness=1.0, chat_warmup_delay_ms=0.0, collect_cpu_stats=True, collect_gpu_stats=False, collect_server_stats=False, config_file=None, dataset_mode=<DatasetMode.HUGGINGFACE: 'huggingface'>, dataset_name='random', dataset_path=None, delay_between_chat_turns=None, disable_tqdm=False, endpoint='/v1/chat/completions', host='localhost', ignore_first_turn_stats=False, lora=None, lora_output_dir='/tmp/loras', lora_paths=[], lora_rank=16, lora_request_ratio=0.5, lora_server_path='/tmp/loras', lora_target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj'], max_benchmark_duration_s=None, max_concurrency=None, max_concurrent_lora_ops=1, max_num_loras=10, max_output_len=None, metadata=[], model='Qwen/Qwen2.5-VL-32B-Instruct', model_max_length=None, num_chat_sessions=None, num_loras=0, num_prompts=1, obfuscated_conversations_average_output_len=175, obfuscated_conversations_coefficient_of_variation=0.1, obfuscated_conversations_shuffle=False, output_lengths=None, port=8000, print_inputs_and_outputs=False, random_coefficient_of_variation='0.1,0.6', random_distribution_type='normal', random_first_turn_ratio=1.0, random_image_count=6, random_image_size='512,512', random_input_len=40, random_max_num_unique_sys_prompt=1, random_num_turns=1, random_output_len=15, random_sys_prompt_ratio=0.0, record_output_lengths=None, request_rate='inf', result_dir=None, result_filename=None, save_result=False, seed=0, server_args='', skip_first_n_requests=0, skip_test_prompt=False, sonnet_input_len=550, sonnet_prefix_len=200, temperature=0.0, tokenizer=None, top_k=None, top_p=1.0, trust_remote_code=False) 19:47:46.597 INFO: benchmark_serving: getting tokenizer. api url: http://localhost:8000/v1/chat/completions 19:47:47.116 INFO: benchmark_serving: sampling requests 19:47:47.116 INFO: benchmark_shared.datasets.random: Random samples in normal distribution 19:47:47.169 INFO: benchmark_serving: starting benchmark run 19:47:47.170 INFO: benchmark_serving: Starting initial single prompt test run... 19:47:49.065 INFO: benchmark_serving: Initial test run completed. Starting main benchmark run... 19:47:49.065 INFO: benchmark_serving: Input request rate: inf 19:47:49.065 INFO: benchmark_serving: Burstiness factor: 1.0 (Poisson process) 19:47:49.065 INFO: benchmark_serving: Maximum request concurrency: None 100%|███████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.02s/it] ============ Serving Benchmark Result ============ Successful requests: 1 Failed requests: 0 Benchmark duration (s): 1.02 Total input tokens: 1583 Total generated tokens: 19 Total nonempty serving response chunks: 19 Input request rate (req/s): inf Request throughput (req/s): 0.97580 ------------Client Experience Metrics------------- Max Concurrency: 1 Mean input token throughput (tok/s): 2820.11 Std input token throughput (tok/s): 0.00 Median input token throughput (tok/s): 2820.11 P90 input token throughput (tok/s): 2820.11 P95 input token throughput (tok/s): 2820.11 P99 input token throughput (tok/s): 2820.11 Mean output token throughput (tok/s): 39.24 Std output token throughput (tok/s): 0.00 Median output token throughput (tok/s): 39.24 P90 output token throughput (tok/s): 39.24 P95 output token throughput (tok/s): 39.24 P99 output token throughput (tok/s): 39.24 ---------------Time to First Token---------------- Mean TTFT (ms): 561.33 Std TTFT (ms): 0.00 Median TTFT (ms): 561.33 P90 TTFT (ms): 561.33 P95 TTFT (ms): 561.33 P99 TTFT (ms): 561.33 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 25.48 Std TPOT (ms): 0.00 Median TPOT (ms): 25.48 P90 TPOT (ms): 25.48 P95 TPOT (ms): 25.48 P99 TPOT (ms): 25.48 ---------------Inter-token Latency---------------- Mean ITL (ms): 25.36 Std ITL (ms): 71.50 Median ITL (ms): 0.11 P90 ITL (ms): 65.15 P95 ITL (ms): 219.78 P99 ITL (ms): 234.45 -------------Per-Request E2E Latency-------------- Mean Request Latency (ms): 1019.98 Std Request Latency (ms): 0.00 Median Request Latency (ms): 1019.98 P90 Request Latency (ms): 1019.98 P95 Request Latency (ms): 1019.98 P99 Request Latency (ms): 1019.98 -------------------Token Stats-------------------- Max input tokens: 1583 Max output tokens: 19 Max total tokens: 1602 --------------------CPU Stats--------------------- CPU User Utilization (%): 202.88 CPU System Utilization (%): 5.85 ================================================== 19:47:50.126 INFO: benchmark_serving: finished benchmark run: Success. ``` MAX_PYTHON_ORIG_REV_ID: 0ca0f2ec8724d830dcd06af48862e2b981b3ddc5
This adds prefix caching support for some VLM models. This has only been ad hoc tested with Qwen2.5VL. This should (hopefully) work with other VLM models like InternVL / Pixtral / Llama Vision with some additional effort. That will be done in followups. The implementation of prefix caching is inspired by vLLM's implementation (vllm-project/vllm#11187). However, we diverge slightly in our implementation. Say `<vision_token_id>` is 98, `<vision_start_token_id>` is 97, and `<vision_end_token_id>` is 99. We ignore the <vision_start_token_id> and <vision_end_token_id> tokens. Consider the following context: ``` |<-- img0 --->| |<--- img1 -->| 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 tokens = np.array([51, 52, 53, 54, 97, 98, 98, 98, 98, 99, 55, 56, 57, 58, 97, 98, 98, 98, 98, 99, 59, 60, 61, 62]) ``` Notice that this context has two images. - img0: [5,9) - img1: [14,19) Recipe: - First hash the pixel values for the two images. Say `hash(img_1) => 27712389489` and `hash(img_2) => -90834898922`. - Replace the first `<vision_token_id>` in the token array with the hash for that image. - Chunk the token array and compute hashes as usual. Eg: ``` |<------ img0 ------->| |<-------- img1 ------->| 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 tokens = np.array([51, 52, 53, 54, 97, 27712389489, 98, 98, 98, 99, 55, 56, 57, 58, 97, -90834898922, 98, 98, 98, 99, 59, 60, 61, 62]) ``` Say PageSize == 4: ``` hash1 = hash(None , [ 11 12 13 14 ]) => hasha hash2 = hash(hasha , [ 27712389489 999 999 999 ]) => hashb hash3 = hash(hashb , [ 999 1000 15 16 ]) => hashc ... ``` Notice that this allows us to get prefix cache hits in some prefix of the prompt, even if the request supplies a different image later in the prompt. Also this design trivially supports multiple images. WARNING!!!! This implementation has some limitations, namely that we will rerun vision encoding for ALL images in the prompt even if we get a cache hit on all tokens of some images. For example, if a req has img0 and img1 and we say we get cache hit on img0. In an ideal impl, we will run vision encoding just for img1. However, in this impl, we will run vision encoding on both img0 and img1. We then do a gather to get the subset of embeddings we need. Then we do a scatter operations on that subset of embeddings to update the text embs at the correct indices. If we need to encode > 1 image, we encode all N. If we need to encode 0 images, we encode 0 images. As such, this impl solely speeds up CE of text model but does not speed up vision encoding unless we get a cache hit on all images in prompt. The reason we do this is because it is sort of hard to do the right thing since we precompute the vision encoder inputs in the tokenizer. We need to fix our code to stop adding a million things to `extra_model_args`. This is extremely hard to work with. We end up needing to perform a masked scatter op to update the relevant rows of text embeddings corresponding to image tokens. The image embeddings that are unneeded due to prefix cache hits are masked, while the rest are scattered. This masked scatter is implemented via a gather -> scatter. The correctness bug I was hunting down ended up being due to a mis-preparation of the decoder_position_ids inputs. The "Recompute this value on the fly." branch of the if statement led me to believe I could just run that code to recompute the position ids on prefix cache hit. However, that is not the case as that branch only handles case where next_tokens has no image, like during TG. While hunting down that bug, I also noticed that we are likely not handling preemptions appropriately in the existing code. This may cause correctness issues after a preemption. ``` [M] ubuntu@workspace-bez-a100-57bf475f6-qmpjn:~/syncmodular$ br //:max --config disable-mypy -- serve --model Qwen/Qwen2.5-VL-32B-Instruct --device-memory-utilization 0.85 --no-enable-chunked-prefill --trust-remote-code --devices gpu:0,1 INFO: Invocation ID: 8090289f-32d9-463f-8e54-448fdfe21711 INFO: Streaming build results to: https://modular.buildbuddy.io/invocation/8090289f-32d9-463f-8e54-448fdfe21711 INFO: Analyzed target //:max (0 packages loaded, 2 targets and 1 aspect configured). INFO: Found 1 target... Target //SDK/lib/API/python/max/entrypoints:pipelines up-to-date: bazel-bin/SDK/lib/API/python/max/entrypoints/pipelines Aspect //bazel/internal:mypy.bzl%mypy_aspect of //:max up-to-date (nothing to build) INFO: Elapsed time: 1.569s, Critical Path: 0.14s INFO: 4 processes: 1 action cache hit, 4 internal. INFO: Build completed successfully, 4 total actions INFO: Streaming build results to: https://modular.buildbuddy.io/invocation/8090289f-32d9-463f-8e54-448fdfe21711 19:43:32.585 INFO: Metrics initialized. 19:43:40.215 INFO: 19:43:40.215 INFO: ============================================================ 19:43:40.215 INFO: Pipeline Configuration (use --pretty-print-config to print full config) 19:43:40.215 INFO: ============================================================ 19:43:40.215 INFO: model : Qwen/Qwen2.5-VL-32B-Instruct 19:43:40.215 INFO: architecture : Qwen2_5_VLForConditionalGeneration 19:43:40.215 INFO: pipeline : TextGenerationPipeline 19:43:40.215 INFO: devices : gpu[0], gpu[1] 19:43:40.215 INFO: max_batch_size : 512 19:43:40.215 INFO: max_seq_len : 128000 19:43:40.215 INFO: cache_memory : 64.09 GiB 19:43:40.215 INFO: 19:43:41.393 INFO: Starting server... Starting ASGI metrics server on port 8001 19:43:46.070 INFO: Metrics initialized. 19:43:46.662 INFO: Starting download of model: Qwen/Qwen2.5-VL-32B-Instruct 100%|██████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 72.36it/s] 19:43:46.919 INFO: Finished download of model: Qwen/Qwen2.5-VL-32B-Instruct in 0.256844 seconds. 19:44:06.058 INFO: Building and compiling vision model... 19:45:39.941 INFO: Building and compiling vision model took 93.882900 seconds 19:45:39.941 INFO: Building and compiling language model... 19:47:20.563 INFO: Building and compiling language model took 100.621737 seconds 19:47:27.918 INFO: ******************************************************************************** 🚀 Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit) ******************************************************************************** 19:47:48.497 INFO: Executed CE batch with 1 reqs | Terminated: 0 reqs, Pending: 0 reqs | Input Tokens: 2022/8192 toks | Prompt Tput: 1.9K tok/s, Generation Tput: 0.9 tok/s | Batch creation: 285.01us, Execution: 1.08s | KVCache usage: 0.8% of 2050 blocks | Cache hit rate: 0.0% | All Preemptions: 0 reqs 19:47:48.740 INFO: Executed TG batch with 1 reqs | Terminated: 0 reqs, Pending: 0 reqs | Input Tokens: 1/INF toks | Prompt Tput: 4.1 tok/s, Generation Tput: 41.1 tok/s | Batch creation: 66.45us, Execution: 243.34ms | KVCache usage: 0.8% of 2050 blocks | Cache hit rate: 0.0% | All Preemptions: 0 reqs 19:47:49.051 INFO: Executed TG batch with 1 reqs | Terminated: 1 reqs, Pending: 0 reqs | Input Tokens: 1/INF toks | Prompt Tput: 3.2 tok/s, Generation Tput: 29.0 tok/s | Batch creation: 38.93us, Execution: 310.24ms | KVCache usage: 0.0% of 2050 blocks | Cache hit rate: 0.0% | All Preemptions: 0 reqs 19:47:49.656 INFO: Executed CE batch with 1 reqs | Terminated: 0 reqs, Pending: 0 reqs | Input Tokens: 102/8192 toks | Prompt Tput: 265.2 tok/s, Generation Tput: 2.6 tok/s | Batch creation: 199.94us, Execution: 384.66ms | KVCache usage: 0.8% of 2050 blocks | Cache hit rate: 95.0% | All Preemptions: 0 reqs 19:47:49.896 INFO: Executed TG batch with 1 reqs | Terminated: 0 reqs, Pending: 0 reqs | Input Tokens: 1/INF toks | Prompt Tput: 4.2 tok/s, Generation Tput: 41.8 tok/s | Batch creation: 41.60us, Execution: 239.10ms | KVCache usage: 0.8% of 2050 blocks | Cache hit rate: 0.0% | All Preemptions: 0 reqs 19:47:50.111 INFO: Executed TG batch with 1 reqs | Terminated: 1 reqs, Pending: 0 reqs | Input Tokens: 1/INF toks | Prompt Tput: 4.6 tok/s, Generation Tput: 41.8 tok/s | Batch creation: 35.55us, Execution: 215.12ms | KVCache usage: 0.0% of 2050 blocks | Cache hit rate: 0.0% | All Preemptions: 0 reqs ``` The relevant log is: ``` Cache hit rate: 95.0% ``` ``` [M] ubuntu@workspace-bez-a100-57bf475f6-qmpjn:~/syncmodular$ br //open-source/max/benchmark:benchmark_serving --config disable-mypy -- --backend modular --model Qwen/Qwen2.5-VL-32B-Instruct --endpoint /v1/chat/completions --host localhost --port 8000 --dataset-name random --random-input-len 40 --random-output-len 15 --random-image-size 512,512 --random-image-count=6 --random-coefficient-of-variation 0.1,0.6 --num-prompts 1 INFO: Invocation ID: 8ca467d1-76dc-4239-89da-e18de78cae4f INFO: Streaming build results to: https://modular.buildbuddy.io/invocation/8ca467d1-76dc-4239-89da-e18de78cae4f INFO: Analyzed target //open-source/max/benchmark:benchmark_serving (0 packages loaded, 4 targets and 1 aspect configured). INFO: Found 1 target... Target //open-source/max/benchmark:benchmark_serving up-to-date: bazel-bin/open-source/max/benchmark/benchmark_serving Aspect //bazel/internal:mypy.bzl%mypy_aspect of //open-source/max/benchmark:benchmark_serving up-to-date (nothing to build) INFO: Elapsed time: 0.900s, Critical Path: 0.09s INFO: 5 processes: 1 action cache hit, 5 internal. INFO: Build completed successfully, 5 total actions INFO: Streaming build results to: https://modular.buildbuddy.io/invocation/8ca467d1-76dc-4239-89da-e18de78cae4f None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. 19:47:46.597 INFO: benchmark_serving: Namespace(arxiv_summarization_input_len=15000, backend='modular', base_url=None, batch_job_image_dir=None, burstiness=1.0, chat_warmup_delay_ms=0.0, collect_cpu_stats=True, collect_gpu_stats=False, collect_server_stats=False, config_file=None, dataset_mode=<DatasetMode.HUGGINGFACE: 'huggingface'>, dataset_name='random', dataset_path=None, delay_between_chat_turns=None, disable_tqdm=False, endpoint='/v1/chat/completions', host='localhost', ignore_first_turn_stats=False, lora=None, lora_output_dir='/tmp/loras', lora_paths=[], lora_rank=16, lora_request_ratio=0.5, lora_server_path='/tmp/loras', lora_target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj'], max_benchmark_duration_s=None, max_concurrency=None, max_concurrent_lora_ops=1, max_num_loras=10, max_output_len=None, metadata=[], model='Qwen/Qwen2.5-VL-32B-Instruct', model_max_length=None, num_chat_sessions=None, num_loras=0, num_prompts=1, obfuscated_conversations_average_output_len=175, obfuscated_conversations_coefficient_of_variation=0.1, obfuscated_conversations_shuffle=False, output_lengths=None, port=8000, print_inputs_and_outputs=False, random_coefficient_of_variation='0.1,0.6', random_distribution_type='normal', random_first_turn_ratio=1.0, random_image_count=6, random_image_size='512,512', random_input_len=40, random_max_num_unique_sys_prompt=1, random_num_turns=1, random_output_len=15, random_sys_prompt_ratio=0.0, record_output_lengths=None, request_rate='inf', result_dir=None, result_filename=None, save_result=False, seed=0, server_args='', skip_first_n_requests=0, skip_test_prompt=False, sonnet_input_len=550, sonnet_prefix_len=200, temperature=0.0, tokenizer=None, top_k=None, top_p=1.0, trust_remote_code=False) 19:47:46.597 INFO: benchmark_serving: getting tokenizer. api url: http://localhost:8000/v1/chat/completions 19:47:47.116 INFO: benchmark_serving: sampling requests 19:47:47.116 INFO: benchmark_shared.datasets.random: Random samples in normal distribution 19:47:47.169 INFO: benchmark_serving: starting benchmark run 19:47:47.170 INFO: benchmark_serving: Starting initial single prompt test run... 19:47:49.065 INFO: benchmark_serving: Initial test run completed. Starting main benchmark run... 19:47:49.065 INFO: benchmark_serving: Input request rate: inf 19:47:49.065 INFO: benchmark_serving: Burstiness factor: 1.0 (Poisson process) 19:47:49.065 INFO: benchmark_serving: Maximum request concurrency: None 100%|███████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.02s/it] ============ Serving Benchmark Result ============ Successful requests: 1 Failed requests: 0 Benchmark duration (s): 1.02 Total input tokens: 1583 Total generated tokens: 19 Total nonempty serving response chunks: 19 Input request rate (req/s): inf Request throughput (req/s): 0.97580 ------------Client Experience Metrics------------- Max Concurrency: 1 Mean input token throughput (tok/s): 2820.11 Std input token throughput (tok/s): 0.00 Median input token throughput (tok/s): 2820.11 P90 input token throughput (tok/s): 2820.11 P95 input token throughput (tok/s): 2820.11 P99 input token throughput (tok/s): 2820.11 Mean output token throughput (tok/s): 39.24 Std output token throughput (tok/s): 0.00 Median output token throughput (tok/s): 39.24 P90 output token throughput (tok/s): 39.24 P95 output token throughput (tok/s): 39.24 P99 output token throughput (tok/s): 39.24 ---------------Time to First Token---------------- Mean TTFT (ms): 561.33 Std TTFT (ms): 0.00 Median TTFT (ms): 561.33 P90 TTFT (ms): 561.33 P95 TTFT (ms): 561.33 P99 TTFT (ms): 561.33 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 25.48 Std TPOT (ms): 0.00 Median TPOT (ms): 25.48 P90 TPOT (ms): 25.48 P95 TPOT (ms): 25.48 P99 TPOT (ms): 25.48 ---------------Inter-token Latency---------------- Mean ITL (ms): 25.36 Std ITL (ms): 71.50 Median ITL (ms): 0.11 P90 ITL (ms): 65.15 P95 ITL (ms): 219.78 P99 ITL (ms): 234.45 -------------Per-Request E2E Latency-------------- Mean Request Latency (ms): 1019.98 Std Request Latency (ms): 0.00 Median Request Latency (ms): 1019.98 P90 Request Latency (ms): 1019.98 P95 Request Latency (ms): 1019.98 P99 Request Latency (ms): 1019.98 -------------------Token Stats-------------------- Max input tokens: 1583 Max output tokens: 19 Max total tokens: 1602 --------------------CPU Stats--------------------- CPU User Utilization (%): 202.88 CPU System Utilization (%): 5.85 ================================================== 19:47:50.126 INFO: benchmark_serving: finished benchmark run: Success. ``` MAX_PYTHON_ORIG_REV_ID: 0ca0f2ec8724d830dcd06af48862e2b981b3ddc5
This PR enables prefix caching for VLMs. Specifically, we enhanced the KV block hash to support extra keys with the image hash and offset.
Block Hash Format
Taking a series of 3 blocks as an example:
T0,T1,P00,P01 | P02,P03,P04,T2 | T3,P10,P11,P12
, whereTi
isi-th
text token andPxy
is they-th
placeholder token of thex-th
image, so this prompt has 2 images (P0 and P1). Assuming the image hash of P0 and P1 isaaa
andbbb
, respectively, andmm_positions=[(offset=2, length=5), (offset=9, length=3)]
, the hash of 3 blocks is as followsA more straightforward is to embed the image hash and offset directly:
We don't adopt this approach because it needs to traverse all input tokens and replace placeholder tokens with the tuple.
Performance Optimization
To reduce the overhead of computing the extra keys of each block, this PR adds an optimization that caches the computed hash values in
Request
, so that we guarantee the block hash for a request only needs to be computed once.Benchmark
We benchmarked the throughput using Llava-1.6-Mistral-7B with 500 prompts on L40S GPU. The image hit rate is set to 30%, meaning that we have 500*0.7=350 unique images and 500-350=150 redundant requests. We put the redundant requests together to achieve the best cache locality for better illustration the effectiveness of prefix caching. The benchmark script is https://gist.github.com/comaniac/ea26df17fdffa533cf53d53b8455bc31
Note: Now prefix caching for VLMs is enabled by default, but it requires the image hashes from mm cache preprocessor, so the following command (enabled prefix caching w/o mm cache preprocessor) will result in error. @alexm-neuralmagic please let me know what's the best practice for this.
cc @alexm-neuralmagic @ywang96 @rickyyx