feat: support runai streamer for vllm #423

cr7258 · 2025-05-19T09:10:53Z

What this PR does / why we need it

Add a new config runai-streamer in the vLLM BackendRuntime to allow loading model using Run:ai Model Streamer to enhance model loading times. Currently, only vllm supports Run:ai Model Streamer.

[RunAI Streamer] Overall time to stream 942.3 MiB of all files: 0.18s, 5.0 GiB/s

kl qwen2-0--5b-0  
Defaulted container "model-runner" out of: model-runner, model-loader (init)
INFO 05-19 02:00:18 __init__.py:207] Automatically detected platform cuda.
INFO 05-19 02:00:18 api_server.py:912] vLLM API server version 0.7.3
INFO 05-19 02:00:18 api_server.py:913] args: Namespace(host='0.0.0.0', port=8080, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='/workspace/models/models--Qwen--Qwen2-0.5B-Instruct', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='runai_streamer', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['qwen2-0--5b'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
INFO 05-19 02:00:18 api_server.py:209] Started engine process with PID 22
INFO 05-19 02:00:22 __init__.py:207] Automatically detected platform cuda.
INFO 05-19 02:00:24 config.py:549] This model supports multiple tasks: {'classify', 'embed', 'reward', 'score', 'generate'}. Defaulting to 'generate'.
INFO 05-19 02:00:28 config.py:549] This model supports multiple tasks: {'reward', 'classify', 'embed', 'score', 'generate'}. Defaulting to 'generate'.
INFO 05-19 02:00:28 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/workspace/models/models--Qwen--Qwen2-0.5B-Instruct', speculative_config=None, tokenizer='/workspace/models/models--Qwen--Qwen2-0.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.RUNAI_STREAMER, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=qwen2-0--5b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, 
INFO 05-19 02:00:29 cuda.py:229] Using Flash Attention backend.
INFO 05-19 02:00:30 model_runner.py:1110] Starting to load model /workspace/models/models--Qwen--Qwen2-0.5B-Instruct...
Loading safetensors using Runai Model Streamer:   0% Completed | 0/1 [00:00<?, ?it/s]
[RunAI Streamer] CPU Buffer size: 942.3 MiB for file: model.safetensors
Read throughput is 9.41 GB per second 
Loading safetensors using Runai Model Streamer: 100% Completed | 1/1 [00:00<00:00,  5.47it/s]
Loading safetensors using Runai Model Streamer: 100% Completed | 1/1 [00:00<00:00,  5.47it/s]

[RunAI Streamer] Overall time to stream 942.3 MiB of all files: 0.18s, 5.0 GiB/s
INFO 05-19 02:00:30 model_runner.py:1115] Loading model weights took 0.9277 GB
INFO 05-19 02:00:31 worker.py:267] Memory profiling takes 0.88 seconds
INFO 05-19 02:00:31 worker.py:267] the current vLLM instance can use total_gpu_memory (22.18GiB) x gpu_memory_utilization (0.90) = 19.97GiB
INFO 05-19 02:00:31 worker.py:267] model weights take 0.93GiB; non_torch_memory takes 0.06GiB; PyTorch activation peak memory takes 1.44GiB; the rest of the memory reserved for KV Cache is 17.54GiB.
INFO 05-19 02:00:31 executor_base.py:111] # cuda blocks: 95795, # CPU blocks: 21845
INFO 05-19 02:00:31 executor_base.py:116] Maximum concurrency for 32768 tokens per request: 46.77x
INFO 05-19 02:00:36 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:12<00:00,  2.78it/s]
INFO 05-19 02:00:49 model_runner.py:1562] Graph capturing finished in 13 secs, took 0.15 GiB
INFO 05-19 02:00:49 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 18.59 seconds
INFO 05-19 02:00:50 api_server.py:958] Starting vLLM API server on http://0.0.0.0:8080
INFO 05-19 02:00:50 launcher.py:23] Available routes are:
INFO 05-19 02:00:50 launcher.py:31] Route: /openapi.json, Methods: HEAD, GET
INFO 05-19 02:00:50 launcher.py:31] Route: /docs, Methods: HEAD, GET
INFO 05-19 02:00:50 launcher.py:31] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 05-19 02:00:50 launcher.py:31] Route: /redoc, Methods: HEAD, GET
INFO 05-19 02:00:50 launcher.py:31] Route: /health, Methods: GET
INFO 05-19 02:00:50 launcher.py:31] Route: /ping, Methods: POST, GET
INFO 05-19 02:00:50 launcher.py:31] Route: /tokenize, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /detokenize, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /v1/models, Methods: GET
INFO 05-19 02:00:50 launcher.py:31] Route: /version, Methods: GET
INFO 05-19 02:00:50 launcher.py:31] Route: /v1/chat/completions, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /v1/completions, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /v1/embeddings, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /pooling, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /score, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /v1/score, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /v1/audio/transcriptions, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /rerank, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /v1/rerank, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /v2/rerank, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /invocations, Methods: POST
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     240.243.170.78:46952 - "GET /health HTTP/1.1" 200 OK
INFO:     240.243.170.78:46958 - "GET /health HTTP/1.1" 200 OK
INFO:     240.243.170.78:35464 - "GET /health HTTP/1.1" 200 OK
INFO:     240.243.170.78:35468 - "GET /health HTTP/1.1" 200 OK
INFO:     240.243.170.78:35482 - "GET /health HTTP/1.1" 200 OK
INFO:     240.243.170.78:51796 - "GET /health HTTP/1.1" 200 OK

Which issue(s) this PR fixes

Fixes #352

Special notes for your reviewer

Does this PR introduce a user-facing change?

support runai streamer for vllm

cr7258 · 2025-05-19T09:11:53Z

/kind feature

kerthcet · 2025-05-20T02:37:46Z

What we hope to achieve here is generally two things:

can we bump this into our model loader component and make it inference-agnostic
can we support with other GPUs other than Nvidia

Both needs experiments, sorry I didn't explain clearly here. The original comment: #352 (comment)

The configuration is open for users actually, so we don't need to do anything I think.

cr7258 · 2025-05-20T04:30:17Z

can we bump this into our model loader component and make it inference-agnostic

As I understand it, the model loader is responsible for downloading models from remote storage, such as Hugging Face or OSS, to the local disk. When the inference container starts, it uses the model that has already been downloaded locally.

Run:ai Model Streamer can speed up model loading by concurrently loading already-read tensors into the GPU while continuing to read other tensors from storage. This acceleration happens after the model has been downloaded locally, so I don't think we have anything to do in model loader for supporting Run:ai Model Streamer.

Additionally, Run:ai Model Streamer is not inference-agnostic — it requires integration with an inference engine, and currently only vLLM is supported. (Related PR)

kerthcet · 2025-05-25T01:05:42Z

I thought a bit about this, I think you're right, we can do nothing here. The original idea here is try to explore whether we can load the models to the GPU and sent the GPU alloc address to the inference engine. However, seems no engine supports this and foreseeable future.

But one thing we should be care about here is we still load the models to the disk rather than to the cpu buffer -> GPU memory, so I suggest let's add annotation to the Playground | Inference Service, then in orchestration, once we detected that the Inference Service has the annotation, we'll not construct the initContainer, also will not render the ModelPath in the arguments, so the inference engine will handle all the loading logic.

Will you like to refactor the PR based on this? @cr7258

kerthcet · 2025-05-25T01:11:33Z

chart/templates/backends/vllm.yaml

@@ -77,6 +77,26 @@ spec:
        limits:
          cpu: 8
          memory: 16Gi
+    - name: runai-streamer


It can be part of the example but I wouldn't like to make it part of the default template.

cr7258 · 2025-05-26T04:02:44Z

@kerthcet Ok, I'll refactor the PR this week.

feat: support runai streamer for vllm

f96ca4f

InftyAI-Agent added needs-triage Indicates an issue or PR lacks a label and requires one. needs-priority Indicates a PR lacks a label and requires one. do-not-merge/needs-kind Indicates a PR lacks a label and requires one. labels May 19, 2025

InftyAI-Agent requested a review from kerthcet May 19, 2025 09:11

InftyAI-Agent added feature Categorizes issue or PR as related to a new feature. and removed do-not-merge/needs-kind Indicates a PR lacks a label and requires one. labels May 19, 2025

kerthcet reviewed May 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: support runai streamer for vllm #423

feat: support runai streamer for vllm #423

Uh oh!

cr7258 commented May 19, 2025 •

edited

Loading

Uh oh!

cr7258 commented May 19, 2025

Uh oh!

kerthcet commented May 20, 2025

Uh oh!

cr7258 commented May 20, 2025

Uh oh!

kerthcet commented May 25, 2025

Uh oh!

kerthcet May 25, 2025

Uh oh!

cr7258 commented May 26, 2025

Uh oh!

Uh oh!

Uh oh!

feat: support runai streamer for vllm #423

Are you sure you want to change the base?

feat: support runai streamer for vllm #423

Uh oh!

Conversation

cr7258 commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it

Which issue(s) this PR fixes

Special notes for your reviewer

Does this PR introduce a user-facing change?

Uh oh!

cr7258 commented May 19, 2025

Uh oh!

kerthcet commented May 20, 2025

Uh oh!

cr7258 commented May 20, 2025

Uh oh!

kerthcet commented May 25, 2025

Uh oh!

kerthcet May 25, 2025

Choose a reason for hiding this comment

Uh oh!

cr7258 commented May 26, 2025

Uh oh!

Uh oh!

cr7258 commented May 19, 2025 •

edited

Loading