Skip to content

feat: support runai streamer for vllm #423

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

cr7258
Copy link
Contributor

@cr7258 cr7258 commented May 19, 2025

What this PR does / why we need it

Add a new config runai-streamer in the vLLM BackendRuntime to allow loading model using Run:ai Model Streamer to enhance model loading times. Currently, only vllm supports Run:ai Model Streamer.

[RunAI Streamer] Overall time to stream 942.3 MiB of all files: 0.18s, 5.0 GiB/s

kl qwen2-0--5b-0  
Defaulted container "model-runner" out of: model-runner, model-loader (init)
INFO 05-19 02:00:18 __init__.py:207] Automatically detected platform cuda.
INFO 05-19 02:00:18 api_server.py:912] vLLM API server version 0.7.3
INFO 05-19 02:00:18 api_server.py:913] args: Namespace(host='0.0.0.0', port=8080, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='/workspace/models/models--Qwen--Qwen2-0.5B-Instruct', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='runai_streamer', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['qwen2-0--5b'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
INFO 05-19 02:00:18 api_server.py:209] Started engine process with PID 22
INFO 05-19 02:00:22 __init__.py:207] Automatically detected platform cuda.
INFO 05-19 02:00:24 config.py:549] This model supports multiple tasks: {'classify', 'embed', 'reward', 'score', 'generate'}. Defaulting to 'generate'.
INFO 05-19 02:00:28 config.py:549] This model supports multiple tasks: {'reward', 'classify', 'embed', 'score', 'generate'}. Defaulting to 'generate'.
INFO 05-19 02:00:28 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/workspace/models/models--Qwen--Qwen2-0.5B-Instruct', speculative_config=None, tokenizer='/workspace/models/models--Qwen--Qwen2-0.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.RUNAI_STREAMER, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=qwen2-0--5b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, 
INFO 05-19 02:00:29 cuda.py:229] Using Flash Attention backend.
INFO 05-19 02:00:30 model_runner.py:1110] Starting to load model /workspace/models/models--Qwen--Qwen2-0.5B-Instruct...
Loading safetensors using Runai Model Streamer:   0% Completed | 0/1 [00:00<?, ?it/s]
[RunAI Streamer] CPU Buffer size: 942.3 MiB for file: model.safetensors
Read throughput is 9.41 GB per second 
Loading safetensors using Runai Model Streamer: 100% Completed | 1/1 [00:00<00:00,  5.47it/s]
Loading safetensors using Runai Model Streamer: 100% Completed | 1/1 [00:00<00:00,  5.47it/s]

[RunAI Streamer] Overall time to stream 942.3 MiB of all files: 0.18s, 5.0 GiB/s
INFO 05-19 02:00:30 model_runner.py:1115] Loading model weights took 0.9277 GB
INFO 05-19 02:00:31 worker.py:267] Memory profiling takes 0.88 seconds
INFO 05-19 02:00:31 worker.py:267] the current vLLM instance can use total_gpu_memory (22.18GiB) x gpu_memory_utilization (0.90) = 19.97GiB
INFO 05-19 02:00:31 worker.py:267] model weights take 0.93GiB; non_torch_memory takes 0.06GiB; PyTorch activation peak memory takes 1.44GiB; the rest of the memory reserved for KV Cache is 17.54GiB.
INFO 05-19 02:00:31 executor_base.py:111] # cuda blocks: 95795, # CPU blocks: 21845
INFO 05-19 02:00:31 executor_base.py:116] Maximum concurrency for 32768 tokens per request: 46.77x
INFO 05-19 02:00:36 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:12<00:00,  2.78it/s]
INFO 05-19 02:00:49 model_runner.py:1562] Graph capturing finished in 13 secs, took 0.15 GiB
INFO 05-19 02:00:49 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 18.59 seconds
INFO 05-19 02:00:50 api_server.py:958] Starting vLLM API server on http://0.0.0.0:8080
INFO 05-19 02:00:50 launcher.py:23] Available routes are:
INFO 05-19 02:00:50 launcher.py:31] Route: /openapi.json, Methods: HEAD, GET
INFO 05-19 02:00:50 launcher.py:31] Route: /docs, Methods: HEAD, GET
INFO 05-19 02:00:50 launcher.py:31] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 05-19 02:00:50 launcher.py:31] Route: /redoc, Methods: HEAD, GET
INFO 05-19 02:00:50 launcher.py:31] Route: /health, Methods: GET
INFO 05-19 02:00:50 launcher.py:31] Route: /ping, Methods: POST, GET
INFO 05-19 02:00:50 launcher.py:31] Route: /tokenize, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /detokenize, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /v1/models, Methods: GET
INFO 05-19 02:00:50 launcher.py:31] Route: /version, Methods: GET
INFO 05-19 02:00:50 launcher.py:31] Route: /v1/chat/completions, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /v1/completions, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /v1/embeddings, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /pooling, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /score, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /v1/score, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /v1/audio/transcriptions, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /rerank, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /v1/rerank, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /v2/rerank, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /invocations, Methods: POST
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     240.243.170.78:46952 - "GET /health HTTP/1.1" 200 OK
INFO:     240.243.170.78:46958 - "GET /health HTTP/1.1" 200 OK
INFO:     240.243.170.78:35464 - "GET /health HTTP/1.1" 200 OK
INFO:     240.243.170.78:35468 - "GET /health HTTP/1.1" 200 OK
INFO:     240.243.170.78:35482 - "GET /health HTTP/1.1" 200 OK
INFO:     240.243.170.78:51796 - "GET /health HTTP/1.1" 200 OK

Which issue(s) this PR fixes

Fixes #352

Special notes for your reviewer

Does this PR introduce a user-facing change?

support runai streamer for vllm

@InftyAI-Agent InftyAI-Agent added needs-triage Indicates an issue or PR lacks a label and requires one. needs-priority Indicates a PR lacks a label and requires one. do-not-merge/needs-kind Indicates a PR lacks a label and requires one. labels May 19, 2025
@InftyAI-Agent InftyAI-Agent requested a review from kerthcet May 19, 2025 09:11
@cr7258
Copy link
Contributor Author

cr7258 commented May 19, 2025

/kind feature

@InftyAI-Agent InftyAI-Agent added feature Categorizes issue or PR as related to a new feature. and removed do-not-merge/needs-kind Indicates a PR lacks a label and requires one. labels May 19, 2025
@kerthcet
Copy link
Member

What we hope to achieve here is generally two things:

  • can we bump this into our model loader component and make it inference-agnostic
  • can we support with other GPUs other than Nvidia

Both needs experiments, sorry I didn't explain clearly here. The original comment: #352 (comment)

The configuration is open for users actually, so we don't need to do anything I think.

@cr7258
Copy link
Contributor Author

cr7258 commented May 20, 2025

can we bump this into our model loader component and make it inference-agnostic

As I understand it, the model loader is responsible for downloading models from remote storage, such as Hugging Face or OSS, to the local disk. When the inference container starts, it uses the model that has already been downloaded locally.

Run:ai Model Streamer can speed up model loading by concurrently loading already-read tensors into the GPU while continuing to read other tensors from storage. This acceleration happens after the model has been downloaded locally, so I don't think we have anything to do in model loader for supporting Run:ai Model Streamer.

Additionally, Run:ai Model Streamer is not inference-agnostic — it requires integration with an inference engine, and currently only vLLM is supported. (Related PR)

@kerthcet
Copy link
Member

I thought a bit about this, I think you're right, we can do nothing here. The original idea here is try to explore whether we can load the models to the GPU and sent the GPU alloc address to the inference engine. However, seems no engine supports this and foreseeable future.

But one thing we should be care about here is we still load the models to the disk rather than to the cpu buffer -> GPU memory, so I suggest let's add annotation to the Playground | Inference Service, then in orchestration, once we detected that the Inference Service has the annotation, we'll not construct the initContainer, also will not render the ModelPath in the arguments, so the inference engine will handle all the loading logic.

Will you like to refactor the PR based on this? @cr7258

@@ -77,6 +77,26 @@ spec:
limits:
cpu: 8
memory: 16Gi
- name: runai-streamer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be part of the example but I wouldn't like to make it part of the default template.

@cr7258
Copy link
Contributor Author

cr7258 commented May 26, 2025

@kerthcet Ok, I'll refactor the PR this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Categorizes issue or PR as related to a new feature. needs-priority Indicates a PR lacks a label and requires one. needs-triage Indicates an issue or PR lacks a label and requires one.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support runai model streamer for fast model loading
3 participants