-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
Description
Motivation.
This RFC discusses AWS Neuron’s plan to integrate NeuronX Distributed (NxD) Inference with V1 architecture on vLLM.
with: @elaineyz @mrinalks @rgrandhiamzn
Proposed Changes.
Change 1: Chunked prefill support — the default mechanism in V1.
We have upgraded core Neuron interfaces to comply with V1 Engine and enable Chunked Prefill. No modifications to vLLM core interfaces were needed. This change offers a native implementation of Chunked Prefill — a required feature on V1.
- Neuron Worker compliant with V1 Model Loader, and Model Runner.
- Chunked Prefill Worker that enables chunked prefill on NxD Inference.
- No modifications to vLLM Core Interfaces.
Change 2: Optionally disable Chunked Prefill — for higher performance on Neuron.
For customers who see a higher performance on Neuron without Chunked Prefill, we offer an option to disable Chunked Prefill.
Option 1: New scheduler mode in V1 that disables chunked prefill (RECOMMENDED)
- While we understand that chunked prefill is a defining characteristic of the V1 architecture, we believe that allowing the flexibility to optionally disable it can make vLLM more easily adoptable by more hardware providers and users.
- A simple, non-intrusive way to turn off chunked prefill on top of the V1 scheduler has already been developed by IBM Spyre’s implementation. Concretely, we can override the schedule() function to intercept running requests (i.e. in practice this would be decode phase requests), check whether there are requests in the waiting queue (i.e. in practice this would be new requests ready to be prefilled), and temporarily purge the running queue and have the base scheduler perform scheduling as usually (on the waiting requests first). After the base scheduler is finished, the override will then restore the running queue.
Option 2: Override V1 scheduler via a scheduler plugin repository
- We can maintain a lightweight plugin repository that exposes the scheduler extension logic described in Option 1.
Note: There is no additional code in-tree to achieve this behavior. - To opt-in to this implementation, users only need to pip install this repository in addition to vllm, and may continue to invoke vLLM exactly the same as they normally would (e.g. python -m vllm.entrypoints.openai.api_server)
- Users can still disable the scheduler override after installing the plugin via an environment variable.
- We will plan to house the plugin repository on the Aws Neuron GitHub.
Feedback Period.
1 week
CC List.
@aarondou
@aws-navyadhara
@aws-satyajith
@aws-yishanm
@elaineyz
@liangfu
@mrinalks
@rgrandhiamzn
@robertgshaw2-redhat
@simon-mo
@WoosukKwon
@DarkLight1337
Any Other Things.
- The Transformers NeuronX (TNx) framework will be deprecated upon Neuron’s migration to V1.
- Follow-up RFCs:
- Proof-of-concept native integration on vLLM modeling code.
- Performant Chunked Prefill on Neuron
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.