[RFC]: Isolate OpenAI Server Into Separate Process

### Motivation.

Currently, the OpenAI API server and AsyncLLMEngine share the same asyncio event loop. This means that the API server and the CPU components of the AsyncLLMEngine contend for the same resources. Below, we have a chart of Llama-3-8B running ShareGPT at QPS=10 on an H100.

We can see time is split into three buckets:
- Light blue 	→ this is overlapped GPU execution time (call of execute_model)
- Orange 	→ this is CPU execution time in the LLM engine
- All else		→ this is API server time

<img width="851" alt="image" src="https://github.com/user-attachments/assets/96cbc8c5-7fb8-4c48-b40a-9aa00108b1df">

- `execute_model` does not block the asyncio event loop as it is run in run_in_execuctor https://github.com/vllm-project/vllm/blob/d92b3c5cdef59533347ac714a70274f186943019/vllm/executor/gpu_executor.py#L143
- So, we believe that some GPU time is currently already overlapped with the API server

### Proposed Change.

Enable better overlapping of the API server and `AsyncLLMEngine` via multiprocessing by spitting into two processes that communicate over gRPC with `protobufs`. This is roughly the architecture used by TGI (though TGI has more items in the server than we are proposing here.

### Initial Goal

- Write protobuf API to roughly match AsyncLLMEngine `generate()` method, separate API server into separate process and measure the speedup
    - Will exclude LogitsProcessors - it’s agreed that these don’t belong in SamplingParams anyhow. Instead we’ll include the corresponding external API parameters e.g. for guided decoding
    - Can be a gRPC streaming response, which maps nicely to the stream returned by this method. Cancellations can be mapped to request aborts


### Follow Ups
Move items that currently run inside `AsyncLLMEngine` into the API Server for better overlap with the GPU

- Make necessary changes within the engine to allow full token-id/token-id out so that it can be called such that no tokenization/detokenization is done - you can already pass token ids but there are some places where tokenization/detokenization is done unconditionally such as detokenizing the prompt to return
- Decouple the LogitsProcessor construction from the OpenAI API code - into layer/utility for creating them from the top-level parameters. This can then be called from the gRPC API layer to construct the full SamplingParams
- Decouple incremental detokenization logic so that it can also be used in the API server process
- The API server process can also intercept/process any stop strings, and just abort the request as needed





### Feedback Period.

_No response_

### CC List.

@simon-mo @njhill @dsikka @joerunde 

### Any Other Things.

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Isolate OpenAI Server Into Separate Process #6797

Motivation.

Proposed Change.

Initial Goal

Follow Ups

Feedback Period.

CC List.

Any Other Things.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: Isolate OpenAI Server Into Separate Process #6797

Description

Motivation.

Proposed Change.

Initial Goal

Follow Ups

Feedback Period.

CC List.

Any Other Things.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions