[Core] Add engine option to return only deltas or final output #7381

njhill · 2024-08-09T23:44:49Z

The LLMEngine and AsyncLLMEngine APIs will currently return/stream cumulative outputs and prompt-related data for all sequences at every step.

This is more data than needed for LLM.generate or the OpenAI server APIs:

For LLM.generate and non-streaming APIs we only need the final output
For streaming APIs we only require deltas

This PR adds an output_kind parameter to SamplingParams with an enum value of either CUMULATIVE, DELTA, or FINAL_ONLY.

In the DELTA case, data associated with the prompt (prompt token ids, logits, etc.) is returned only in the first output message(s).

It will reduce the number of objects that need to be constructed at each step, and the amount of data to be serialized to return to the newly-decoupled front-end API process.

njhill · 2024-08-12T17:40:13Z

This is ready, I'm just planning to add a couple of new tests for it so will hold off adding the ready label for that.

The LLMEngine and AsyncLLMEngine APIs will currently return/stream cumulative outputs for all sequences at every step. This is more data than needed for LLM.generate or the OpenAI server APIs: - For LLM.generate and non-streaming APIs we only need the final output - For streaming APIs we only require deltas This PR adds an `output_kind` parameter to SamplingParams with an enum value of either CUMULATIVE, DELTA, or FINAL_ONLY. It will reduce the number of objects that need to be constructed at each step, and the amount of data to be serialized to return to the newly-decoupled front-end API process.

vllm/outputs.py

njhill · 2024-08-15T23:39:00Z

Blocked on some testing issues that I'm trying to solve in #7565

# Conflicts: # tests/entrypoints/openai/test_chat.py # vllm/engine/llm_engine.py # vllm/entrypoints/llm.py # vllm/entrypoints/openai/protocol.py # vllm/entrypoints/openai/serving_completion.py # vllm/sampling_params.py

# Conflicts: # vllm/engine/llm_engine.py # vllm/entrypoints/openai/serving_chat.py

# Conflicts: # vllm/engine/llm_engine.py

# Conflicts: # vllm/entrypoints/llm.py

alexm-redhat

@njhill thanks for adding the delta request output support, this is really helpful! Left some nit comments, but looks good overall

vllm/entrypoints/openai/serving_chat.py

alexm-redhat · 2024-09-10T20:05:10Z

vllm/outputs.py

            raise ValueError(
                "Sampling parameters are missing for a CompletionRequest.")
+        finished = seq_group.is_finished()
+        if sampling_params.output_kind == RequestOutputKind.FINAL_ONLY and (


Good idea to use sampling_params as the place for output_kind, makes everything simpler.

vllm/outputs.py

alexm-redhat · 2024-09-10T20:09:16Z

vllm/outputs.py

+
+        outputs = []
+        include_prompt = True
+        for seq in top_n_seqs:


Good that it also includes the beam search case

vllm/sequence.py

Also avoid appending delta token ids to sequences in cases they aren't needed.

# Conflicts: # vllm/sequence.py

# Conflicts: # tests/async_engine/test_async_llm_engine.py

njhill · 2024-09-12T18:36:57Z

Failed tests are unrelated flakes / already failing on main.

alexm-redhat · 2024-09-12T18:40:01Z

Ok cool, LGTM

Regression in 0.6.1.post1 from vllm-project#7381

…project#7381)

…project#7381) Signed-off-by: Amit Garg <[email protected]>

…project#7381) Signed-off-by: LeiWang1999 <[email protected]>

This was referenced Aug 12, 2024

[Core] Move detokenization to front-end process #7402

Closed

[RFC]: Isolate OpenAI Server Into Separate Process #6797

Closed

njhill added 4 commits August 13, 2024 07:38

Fixes

9bc3fdd

Fix ignored sequence case

ef2e59f

Also exclude prompt details in subsequent outputs in delta mode

dc1f3f2

njhill force-pushed the reduce-output branch from d18d646 to dc1f3f2 Compare August 13, 2024 14:39

Fix prompt token counts in streaming cases

9d35a00

joerunde reviewed Aug 14, 2024

View reviewed changes

vllm/outputs.py Outdated Show resolved Hide resolved

njhill added 3 commits August 14, 2024 15:10

Simplification suggestion from @joerunde

b7ff44e

Make tests more robust

34df9bd

Merge remote-tracking branch 'origin/main' into reduce-output

a68506f

njhill mentioned this pull request Aug 26, 2024

[Core] Asynchronous Output Processor #7049

Merged

6 tasks

njhill added 7 commits August 27, 2024 09:17

Merge remote-tracking branch 'origin/main' into reduce-output

cfe7118

# Conflicts: # tests/entrypoints/openai/test_chat.py # vllm/engine/llm_engine.py # vllm/entrypoints/llm.py # vllm/entrypoints/openai/protocol.py # vllm/entrypoints/openai/serving_completion.py # vllm/sampling_params.py

Post-merge wip

45fd069

Merge remote-tracking branch 'origin/main' into reduce-output

3f21ad6

# Conflicts: # vllm/engine/llm_engine.py # vllm/entrypoints/openai/serving_chat.py

Merge remote-tracking branch 'origin/main' into reduce-output

d59ffd1

# Conflicts: # vllm/engine/llm_engine.py

Merge remote-tracking branch 'origin/main' into reduce-output

d2f36dd

Fix delta computation, remove unrelated changes

2843365

Merge remote-tracking branch 'origin/main' into reduce-output

2736ab1

# Conflicts: # vllm/entrypoints/llm.py

alexm-redhat approved these changes Sep 10, 2024

View reviewed changes

vllm-project deleted a comment from github-actions bot Sep 10, 2024

Address Alex's comments, fix include_prompt logic

a045dff

Also avoid appending delta token ids to sequences in cases they aren't needed.

njhill mentioned this pull request Sep 11, 2024

Add conditional prompt inclusion in generated output based on `is_ret… #8360

Closed

njhill added 3 commits September 11, 2024 15:45

Merge remote-tracking branch 'origin/main' into reduce-output

58f6112

# Conflicts: # vllm/sequence.py

Add tests

e7a2b55

Some rework/simplification

6b1f355

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 12, 2024

njhill added 3 commits September 11, 2024 17:45

Remove obsolete engine.step_return_finished_only field

3233a92

Merge remote-tracking branch 'origin/main' into reduce-output

f351ed2

# Conflicts: # tests/async_engine/test_async_llm_engine.py

Merge remote-tracking branch 'origin/main' into reduce-output

75814bd

simon-mo approved these changes Sep 12, 2024

View reviewed changes

simon-mo merged commit 551ce01 into vllm-project:main Sep 12, 2024
64 of 70 checks passed

njhill deleted the reduce-output branch September 12, 2024 19:03

njhill added a commit to njhill/vllm that referenced this pull request Sep 13, 2024

[HotFix] Fix final output truncation with stop string + streaming

500fae5

Regression in 0.6.1.post1 from vllm-project#7381

This was referenced Sep 13, 2024

[HotFix] Fix final output truncation with stop string + streaming #8468

Merged

Add output streaming support to multi-step + async while ensuring RequestOutput obj reuse #8335

Merged

njhill mentioned this pull request Oct 1, 2024

[Bugfix] Fix IndexError when output.logprobs is [] #8989

Closed

njhill mentioned this pull request Oct 15, 2024

[BugFix] Fix chat API continuous usage stats #9357

Merged

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[Core] Add engine option to return only deltas or final output (vllm-…

c6c8b91

…project#7381)

garg-amit pushed a commit to garg-amit/vllm that referenced this pull request Oct 28, 2024

[Core] Add engine option to return only deltas or final output (vllm-…

a211469

…project#7381) Signed-off-by: Amit Garg <[email protected]>

LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025

[Core] Add engine option to return only deltas or final output (vllm-…

f565cf2

…project#7381) Signed-off-by: LeiWang1999 <[email protected]>

hjh0119 mentioned this pull request May 21, 2025

fix n > 1 with vLLM V1 Engine modelscope/ms-swift#4295

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Core] Add engine option to return only deltas or final output #7381

[Core] Add engine option to return only deltas or final output #7381

Uh oh!

njhill commented Aug 9, 2024 •

edited

Loading

Uh oh!

njhill commented Aug 12, 2024

Uh oh!

Uh oh!

njhill commented Aug 15, 2024

Uh oh!

alexm-redhat left a comment

Uh oh!

Uh oh!

alexm-redhat Sep 10, 2024

Uh oh!

Uh oh!

alexm-redhat Sep 10, 2024

Uh oh!

Uh oh!

njhill commented Sep 12, 2024

Uh oh!

alexm-redhat commented Sep 12, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Core] Add engine option to return only deltas or final output #7381

[Core] Add engine option to return only deltas or final output #7381

Uh oh!

Conversation

njhill commented Aug 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

njhill commented Aug 12, 2024

Uh oh!

Uh oh!

njhill commented Aug 15, 2024

Uh oh!

alexm-redhat left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alexm-redhat Sep 10, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alexm-redhat Sep 10, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

njhill commented Sep 12, 2024

Uh oh!

alexm-redhat commented Sep 12, 2024

Uh oh!

Uh oh!

Uh oh!

njhill commented Aug 9, 2024 •

edited

Loading