[RFC]: Neuron Support for V1 Engine

### Motivation.

This RFC discusses AWS Neuron’s plan to integrate [NeuronX Distributed (NxD) Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/index.html) with V1 architecture on vLLM.

with: @elaineyz @mrinalks @rgrandhiamzn

### Proposed Changes.

#### Change 1: Chunked prefill support — the default mechanism in V1.

We have upgraded core Neuron interfaces to comply with V1 Engine and enable Chunked Prefill. No modifications to vLLM core interfaces were needed. This change offers a native implementation of Chunked Prefill — a required feature on V1.

* Neuron Worker compliant with V1 Model Loader, and Model Runner.
* Chunked Prefill Worker that enables chunked prefill on NxD Inference.
* No modifications to vLLM Core Interfaces.

#### Change 2: Optionally disable Chunked Prefill — for higher performance on Neuron.

For customers who see a higher performance on Neuron without Chunked Prefill, we offer an option to disable Chunked Prefill. 

Option 1: New scheduler mode in V1 that disables chunked prefill (RECOMMENDED)

* While we understand that chunked prefill is a defining characteristic of the V1 architecture, we believe that allowing the flexibility to optionally disable it can make vLLM more easily adoptable by more hardware providers and users. 
* A simple, non-intrusive way to turn off chunked prefill on top of the V1 scheduler has already been developed by [IBM Spyre’s implementation](https://github.com/vllm-project/vllm-spyre/blob/main/vllm_spyre/v1/core/scheduler.py#L139). Concretely, we can override the schedule() function to intercept running requests (i.e. in practice this would be decode phase requests), check whether there are requests in the waiting queue (i.e. in practice this would be new requests ready to be prefilled), and temporarily purge the running queue and have the base scheduler perform scheduling as usually (on the waiting requests first). After the base scheduler is finished, the override will then restore the running queue. 

Option 2: Override V1 scheduler via a scheduler plugin repository

* We can maintain a lightweight plugin repository that exposes the scheduler extension logic described in Option 1.
    Note: There is no additional code in-tree to achieve this behavior.
* To opt-in to this implementation, users only need to pip install this repository in addition to vllm, and may continue to invoke vLLM exactly the same as they normally would (e.g. python -m vllm.entrypoints.openai.api_server)
* Users can still disable the scheduler override after installing the plugin via an environment variable.
* We will plan to house the plugin repository on the Aws Neuron GitHub.

### Feedback Period.

1 week

### CC List.

@aarondou
@aws-navyadhara
@aws-satyajith
@aws-yishanm
@elaineyz
@liangfu
@mrinalks
@rgrandhiamzn
@robertgshaw2-redhat
@simon-mo
@WoosukKwon
@DarkLight1337


### Any Other Things.

* The [Transformers NeuronX (TNx)](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/index.html) framework will be deprecated upon Neuron’s migration to V1.
* Follow-up RFCs:
    * Proof-of-concept native integration on vLLM modeling code.
    * Performant Chunked Prefill on Neuron

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Neuron Support for V1 Engine #21082

Motivation.

Proposed Changes.

Change 1: Chunked prefill support — the default mechanism in V1.

Change 2: Optionally disable Chunked Prefill — for higher performance on Neuron.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: Neuron Support for V1 Engine #21082

Description

Motivation.

Proposed Changes.

Change 1: Chunked prefill support — the default mechanism in V1.

Change 2: Optionally disable Chunked Prefill — for higher performance on Neuron.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions