You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In my understanding, Continuous Batching is a trade-off between request latency and throughput, it can return the results of the completed request in time instead of waiting all requests of a batch. But it is seemingly that the “llm.generate()” API alaways wait all requests of a batch have been completed before return.