Skip to content

[Usage]: Ray + vLLM OpenAI (offline) Batch Inference #8636

@mbuet2ner

Description

@mbuet2ner

Your current environment

None

How would you like to use vllm

I want to use the OpenAI library to do offline batch inference leveraging Ray (for scaling and scheduling) on top of vLLM.

Context: The plan is to built a FastAPI service that closely mimicks OpenAI's batch API and allows to process a larger number of prompts (tens of thousands) in 24h. There are a few options of achieving this with vLLM but every one has some important drawback, but maybe I am missing something:

  • There is an existing guide that uses the LLMClass in the docs with Ray. While the LLMClass shares the OpenAI sampling parameters, it does lack the important OpenAI prompt templating.
  • The run_batch.py entrypoint that was introduced here would be the simplest one. But it does not support Ray out of the box.
  • The third option would be to use the AsyncLLMEngine as done here and bundle it with OpenAIServingChat as has been done in run_batch.py. But this would entail some (potential) performance degredation due to going asynch even though it is not really needed for offline batch inference.
  • The fourth option could be to use Ray serve like in this example from Ray's docs. But this would lack the OpenAI batch format and is – again – async.

Maybe this helps other people as well. Would be super grateful for some feedback. 🙂
And thanks a ton for this very nice piece of software and the great community!

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    usageHow to use vllm

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions