Implement auto-batching

At some point we should introduce automatic batching of run requests. Models, especially on the GPU, run more efficiently when inputs are batched.

One possible use case is that multiple run requests to the same model that are sitting in the queue are batched together and invoked once. This could work:
- analyze the queue, see if there are other calls to the same model (with inputs of the same shape) queued up
- take a (configurable) number of requests and assemble the inputs tensors into a single tensor along the 0-th dimension
- call the model
- unpack the 0-th dimension over the output keys for each request
- unblock the clients

This would allow requests from multiple clients to the same model to be batched.

A run could be triggered when a) enough requests have been queued up (aka the batch is large enough) OR b) some time has expired.

We could configure this when calling `MODELSET`, or with a separate command (like `MODELCONFIG BATCH`). Or both.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement auto-batching #67

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement auto-batching #67

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions