Skip to content

Conversation

njhill
Copy link
Member

@njhill njhill commented Aug 9, 2024

The LLMEngine and AsyncLLMEngine APIs will currently return/stream cumulative outputs and prompt-related data for all sequences at every step.

This is more data than needed for LLM.generate or the OpenAI server APIs:

  • For LLM.generate and non-streaming APIs we only need the final output
  • For streaming APIs we only require deltas

This PR adds an output_kind parameter to SamplingParams with an enum value of either CUMULATIVE, DELTA, or FINAL_ONLY.

In the DELTA case, data associated with the prompt (prompt token ids, logits, etc.) is returned only in the first output message(s).

It will reduce the number of objects that need to be constructed at each step, and the amount of data to be serialized to return to the newly-decoupled front-end API process.

@njhill
Copy link
Member Author

njhill commented Aug 12, 2024

This is ready, I'm just planning to add a couple of new tests for it so will hold off adding the ready label for that.

njhill added 4 commits August 13, 2024 07:38
The LLMEngine and AsyncLLMEngine APIs will currently return/stream cumulative outputs for all sequences at every step.

This is more data than needed for LLM.generate or the OpenAI server APIs:
- For LLM.generate and non-streaming APIs we only need the final output
- For streaming APIs we only require deltas

This PR adds an `output_kind` parameter to SamplingParams with an enum value of either CUMULATIVE, DELTA, or FINAL_ONLY.

It will reduce the number of objects that need to be constructed at each step, and the amount of data to be serialized to return to the newly-decoupled front-end API process.
@njhill
Copy link
Member Author

njhill commented Aug 15, 2024

Blocked on some testing issues that I'm trying to solve in #7565

@njhill njhill mentioned this pull request Aug 26, 2024
6 tasks
njhill added 7 commits August 27, 2024 09:17
# Conflicts:
#	tests/entrypoints/openai/test_chat.py
#	vllm/engine/llm_engine.py
#	vllm/entrypoints/llm.py
#	vllm/entrypoints/openai/protocol.py
#	vllm/entrypoints/openai/serving_completion.py
#	vllm/sampling_params.py
# Conflicts:
#	vllm/engine/llm_engine.py
#	vllm/entrypoints/openai/serving_chat.py
# Conflicts:
#	vllm/engine/llm_engine.py
Copy link
Collaborator

@alexm-redhat alexm-redhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@njhill thanks for adding the delta request output support, this is really helpful! Left some nit comments, but looks good overall

raise ValueError(
"Sampling parameters are missing for a CompletionRequest.")
finished = seq_group.is_finished()
if sampling_params.output_kind == RequestOutputKind.FINAL_ONLY and (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea to use sampling_params as the place for output_kind, makes everything simpler.


outputs = []
include_prompt = True
for seq in top_n_seqs:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good that it also includes the beam search case

@vllm-project vllm-project deleted a comment from github-actions bot Sep 10, 2024
Also avoid appending delta token ids to sequences in cases they aren't needed.
@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 12, 2024
@njhill
Copy link
Member Author

njhill commented Sep 12, 2024

Failed tests are unrelated flakes / already failing on main.

@alexm-redhat
Copy link
Collaborator

Ok cool, LGTM

@simon-mo simon-mo merged commit 551ce01 into vllm-project:main Sep 12, 2024
64 of 70 checks passed
@njhill njhill deleted the reduce-output branch September 12, 2024 19:03
njhill added a commit to njhill/vllm that referenced this pull request Sep 13, 2024
Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
garg-amit pushed a commit to garg-amit/vllm that referenced this pull request Oct 28, 2024
LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants