Skip to content

Implement auto-batching #67

Closed
Closed
@lantiga

Description

@lantiga

At some point we should introduce automatic batching of run requests. Models, especially on the GPU, run more efficiently when inputs are batched.

One possible use case is that multiple run requests to the same model that are sitting in the queue are batched together and invoked once. This could work:

  • analyze the queue, see if there are other calls to the same model (with inputs of the same shape) queued up
  • take a (configurable) number of requests and assemble the inputs tensors into a single tensor along the 0-th dimension
  • call the model
  • unpack the 0-th dimension over the output keys for each request
  • unblock the clients

This would allow requests from multiple clients to the same model to be batched.

A run could be triggered when a) enough requests have been queued up (aka the batch is large enough) OR b) some time has expired.

We could configure this when calling MODELSET, or with a separate command (like MODELCONFIG BATCH). Or both.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions