[RFC]: Improve guided decoding (logit_processor) APIs and performance.

### Motivation.

Currently, guided decoding & logit processor API is incomplete has has several issues. The RFC is intended to bring up problems and solutions. Some of issues may have been already addressed and there are PRs out already.

There are 3 major issues.
- It is not supported from SamplingParamters
- It is not possible to support batch/async logit processing.
- Upon failures, engine will die.


### Proposed Change.

API
----
guided decoding parameters are not supported with SamplingParams. It is addressed from https://github.com/vllm-project/vllm/pull/4130

Performance
-------------
Currently, logit processors APIs are applied row by row blocking (https://github.com/vllm-project/vllm/blob/246598a6b1e22616630b7f1bf11bd9bcb31dc860/vllm/model_executor/layers/logits_processor.py#L112). Instead, we can use parallel processing (e.g., ray or thread pool) to improve the logit processing performance. We are using this mechanism internally at Anyscale. We'd like to support this feature in OSS, and would like to improve logit processor API to support 1. async. 2. batching. 

This requires logit processor to be
- stateful (to use a tool like Ray or thread pool). I think this PR https://github.com/vllm-project/vllm/pull/5329 is likely sufficient. 
- async. We'd like to propose "prepare" API which can separate out compute_logits from preparing logits.

```
class LogitPostProcessor:
   def initialize(self, logit_processor_config: LogitProcessorConfig):
       """Initialize the post processor. Post processor may have states
           such as thread pool or Ray actors. It should be initialized
           here.
       """
       ...

   def prepare(
           self,
           seq_gruop_metadata_list: List[SequenceGroupMetadata]):
       """Asynchronously prepare logit masks."""
       ...

   def apply(self, logits: torch.Tensor) -> torch.Tensor:
       """Apply the prepared masks to a given logits."""
       ...

# For each model, we will have

def compute_logits(...):
    ....

def prepare_logits(seq_group_metadata_list):
    ....

```

`prepare` and `apply` assume 1:1 calls. E.g., once prepare is called, apply has to be called before another prepare is called. I think it is the safe assumption. Alternatively, we can make prepare return a class, but that will make interface surface larger, so I don't prefer that solution (but I am open to hear feedback!)

This is the example usage of the API

```
        # each model will have prepare_logits API
        self.model.prepare_logits(seq_group_metadata_list)
        hidden_states = model_executable(
            input_ids=input_tokens,
            positions=input_positions,
            kv_caches=kv_caches,
            attn_metadata=attn_metadata,
            **multi_modal_kwargs,
        )
        # Compute the logits. logit processors are applied here.
        logits = self.model.compute_logits(hidden_states, sampling_metadata)
```

We are also considering to upstream Ray based batch processing implementation with lmformatenforcer.

Failure Handling
----------------
When using a stateful logit processor, it is possible requests are failed. For example, if we use Ray, Ray actors can die. Or there could be user's schema issue that cannot be caught ahead of time.

When it happens, we should fail the seq_group immediately. We will introduce a new status "FINISHED_INTERNAL_ERROR = enum.auto()" to https://github.com/vllm-project/vllm/blob/246598a6b1e22616630b7f1bf11bd9bcb31dc860/vllm/sequence.py#L42. If any logit processor is failed, we will mark the relevant seq_group as failed, and the request will be aborted. 

### Feedback Period.

_No response_

### CC List.

cc @simon-mo @Yard1 

### Any Other Things.

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Improve guided decoding (logit_processor) APIs and performance. #5423

Motivation.

Proposed Change.

API

Performance

Failure Handling

Feedback Period.

CC List.

Any Other Things.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: Improve guided decoding (logit_processor) APIs and performance. #5423

Description

Motivation.

Proposed Change.

API

Performance

Failure Handling

Feedback Period.

CC List.

Any Other Things.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions