"llm.generate()" API support Continuous Batching?

In my understanding, Continuous Batching is a trade-off between request latency and throughput, it can return the results of the completed request in time instead of waiting all requests of a batch. But it is seemingly that the “llm.generate()” API alaways wait all requests of a batch have been completed before return.
![image](https://github.com/vllm-project/vllm/assets/63448337/f33747b2-0447-496b-beb2-bbe4ae2fb177)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

"llm.generate()" API support Continuous Batching? #684

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

"llm.generate()" API support Continuous Batching? #684

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions