Closed
Description
At some point we should introduce automatic batching of run requests. Models, especially on the GPU, run more efficiently when inputs are batched.
One possible use case is that multiple run requests to the same model that are sitting in the queue are batched together and invoked once. This could work:
- analyze the queue, see if there are other calls to the same model (with inputs of the same shape) queued up
- take a (configurable) number of requests and assemble the inputs tensors into a single tensor along the 0-th dimension
- call the model
- unpack the 0-th dimension over the output keys for each request
- unblock the clients
This would allow requests from multiple clients to the same model to be batched.
A run could be triggered when a) enough requests have been queued up (aka the batch is large enough) OR b) some time has expired.
We could configure this when calling MODELSET
, or with a separate command (like MODELCONFIG BATCH
). Or both.