-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Closed
Labels
usageHow to use vllmHow to use vllm
Description
Your current environment
None
How would you like to use vllm
I want to use the OpenAI library to do offline batch inference leveraging Ray (for scaling and scheduling) on top of vLLM.
Context: The plan is to built a FastAPI service that closely mimicks OpenAI's batch API and allows to process a larger number of prompts (tens of thousands) in 24h. There are a few options of achieving this with vLLM but every one has some important drawback, but maybe I am missing something:
- There is an existing guide that uses the
LLMClass
in the docs with Ray. While theLLMClass
shares the OpenAI sampling parameters, it does lack the important OpenAI prompt templating. - The
run_batch.py
entrypoint that was introduced here would be the simplest one. But it does not support Ray out of the box. - The third option would be to use the
AsyncLLMEngine
as done here and bundle it with OpenAIServingChat as has been done in run_batch.py. But this would entail some (potential) performance degredation due to going asynch even though it is not really needed for offline batch inference. - The fourth option could be to use Ray serve like in this example from Ray's docs. But this would lack the OpenAI batch format and is – again – async.
Maybe this helps other people as well. Would be super grateful for some feedback. 🙂
And thanks a ton for this very nice piece of software and the great community!
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
gvspraveen and lbx73737373
Metadata
Metadata
Assignees
Labels
usageHow to use vllmHow to use vllm