-
-
Notifications
You must be signed in to change notification settings - Fork 29
feat: support runai streamer for vllm #423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
/kind feature |
What we hope to achieve here is generally two things:
Both needs experiments, sorry I didn't explain clearly here. The original comment: #352 (comment) The configuration is open for users actually, so we don't need to do anything I think. |
As I understand it, the model loader is responsible for downloading models from remote storage, such as Hugging Face or OSS, to the local disk. When the inference container starts, it uses the model that has already been downloaded locally. Run:ai Model Streamer can speed up model loading by concurrently loading already-read tensors into the GPU while continuing to read other tensors from storage. This acceleration happens after the model has been downloaded locally, so I don't think we have anything to do in model loader for supporting Run:ai Model Streamer. Additionally, Run:ai Model Streamer is not inference-agnostic — it requires integration with an inference engine, and currently only vLLM is supported. (Related PR) |
I thought a bit about this, I think you're right, we can do nothing here. The original idea here is try to explore whether we can load the models to the GPU and sent the GPU alloc address to the inference engine. However, seems no engine supports this and foreseeable future. But one thing we should be care about here is we still load the models to the disk rather than to the cpu buffer -> GPU memory, so I suggest let's add annotation to the Playground | Inference Service, then in orchestration, once we detected that the Inference Service has the annotation, we'll not construct the initContainer, also will not render the ModelPath in the arguments, so the inference engine will handle all the loading logic. Will you like to refactor the PR based on this? @cr7258 |
@@ -77,6 +77,26 @@ spec: | |||
limits: | |||
cpu: 8 | |||
memory: 16Gi | |||
- name: runai-streamer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be part of the example but I wouldn't like to make it part of the default template.
@kerthcet Ok, I'll refactor the PR this week. |
What this PR does / why we need it
Add a new config
runai-streamer
in the vLLM BackendRuntime to allow loading model using Run:ai Model Streamer to enhance model loading times. Currently, only vllm supports Run:ai Model Streamer.Which issue(s) this PR fixes
Fixes #352
Special notes for your reviewer
Does this PR introduce a user-facing change?