Skip to content

[RFC]: Isolate OpenAI Server Into Separate Process #6797

@robertgshaw2-redhat

Description

@robertgshaw2-redhat

Motivation.

Currently, the OpenAI API server and AsyncLLMEngine share the same asyncio event loop. This means that the API server and the CPU components of the AsyncLLMEngine contend for the same resources. Below, we have a chart of Llama-3-8B running ShareGPT at QPS=10 on an H100.

We can see time is split into three buckets:

  • Light blue → this is overlapped GPU execution time (call of execute_model)
  • Orange → this is CPU execution time in the LLM engine
  • All else → this is API server time
image
  • execute_model does not block the asyncio event loop as it is run in run_in_execuctor
    output = await make_async(self.driver_worker.execute_model
  • So, we believe that some GPU time is currently already overlapped with the API server

Proposed Change.

Enable better overlapping of the API server and AsyncLLMEngine via multiprocessing by spitting into two processes that communicate over gRPC with protobufs. This is roughly the architecture used by TGI (though TGI has more items in the server than we are proposing here.

Initial Goal

  • Write protobuf API to roughly match AsyncLLMEngine generate() method, separate API server into separate process and measure the speedup
    • Will exclude LogitsProcessors - it’s agreed that these don’t belong in SamplingParams anyhow. Instead we’ll include the corresponding external API parameters e.g. for guided decoding
    • Can be a gRPC streaming response, which maps nicely to the stream returned by this method. Cancellations can be mapped to request aborts

Follow Ups

Move items that currently run inside AsyncLLMEngine into the API Server for better overlap with the GPU

  • Make necessary changes within the engine to allow full token-id/token-id out so that it can be called such that no tokenization/detokenization is done - you can already pass token ids but there are some places where tokenization/detokenization is done unconditionally such as detokenizing the prompt to return
  • Decouple the LogitsProcessor construction from the OpenAI API code - into layer/utility for creating them from the top-level parameters. This can then be called from the gRPC API layer to construct the full SamplingParams
  • Decouple incremental detokenization logic so that it can also be used in the API server process
  • The API server process can also intercept/process any stop strings, and just abort the request as needed

Feedback Period.

No response

CC List.

@simon-mo @njhill @dsikka @joerunde

Any Other Things.

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions