-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[Core] Add engine option to return only deltas or final output #7381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is ready, I'm just planning to add a couple of new tests for it so will hold off adding the ready label for that. |
The LLMEngine and AsyncLLMEngine APIs will currently return/stream cumulative outputs for all sequences at every step. This is more data than needed for LLM.generate or the OpenAI server APIs: - For LLM.generate and non-streaming APIs we only need the final output - For streaming APIs we only require deltas This PR adds an `output_kind` parameter to SamplingParams with an enum value of either CUMULATIVE, DELTA, or FINAL_ONLY. It will reduce the number of objects that need to be constructed at each step, and the amount of data to be serialized to return to the newly-decoupled front-end API process.
Blocked on some testing issues that I'm trying to solve in #7565 |
# Conflicts: # tests/entrypoints/openai/test_chat.py # vllm/engine/llm_engine.py # vllm/entrypoints/llm.py # vllm/entrypoints/openai/protocol.py # vllm/entrypoints/openai/serving_completion.py # vllm/sampling_params.py
# Conflicts: # vllm/engine/llm_engine.py # vllm/entrypoints/openai/serving_chat.py
# Conflicts: # vllm/engine/llm_engine.py
# Conflicts: # vllm/entrypoints/llm.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@njhill thanks for adding the delta request output support, this is really helpful! Left some nit comments, but looks good overall
raise ValueError( | ||
"Sampling parameters are missing for a CompletionRequest.") | ||
finished = seq_group.is_finished() | ||
if sampling_params.output_kind == RequestOutputKind.FINAL_ONLY and ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea to use sampling_params as the place for output_kind, makes everything simpler.
|
||
outputs = [] | ||
include_prompt = True | ||
for seq in top_n_seqs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good that it also includes the beam search case
Also avoid appending delta token ids to sequences in cases they aren't needed.
# Conflicts: # vllm/sequence.py
# Conflicts: # tests/async_engine/test_async_llm_engine.py
Failed tests are unrelated flakes / already failing on main. |
Ok cool, LGTM |
Regression in 0.6.1.post1 from vllm-project#7381
…project#7381) Signed-off-by: Amit Garg <[email protected]>
…project#7381) Signed-off-by: LeiWang1999 <[email protected]>
The
LLMEngine
andAsyncLLMEngine
APIs will currently return/stream cumulative outputs and prompt-related data for all sequences at every step.This is more data than needed for
LLM.generate
or the OpenAI server APIs:This PR adds an
output_kind
parameter toSamplingParams
with an enum value of eitherCUMULATIVE
,DELTA
, orFINAL_ONLY
.In the
DELTA
case, data associated with the prompt (prompt token ids, logits, etc.) is returned only in the first output message(s).It will reduce the number of objects that need to be constructed at each step, and the amount of data to be serialized to return to the newly-decoupled front-end API process.