Batch processing should use a currently-missing batch dimension for all tensors

# Prerequisites

Please answer the following questions for yourself before submitting an issue.

- [x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [x] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
- [x] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new bug or useful enhancement to share.

# Feature Description

While doing experiments with batched processing I noticed that performance of batching two sequences with length N was exactly equal to the performance of running a single batch inference of length 2N. This seemed peculiar to me, as I understood most implementations to use a separate dimension for batching, so performance between the two scenarios should have been different in some way. Looking through the code, I fail to see the batch dimension in any of the tensors constructed in the various `build_*()` functions. From what I can tell, batches are done via extending the tokens dimension to match the combined length of all batches, while other implementations extend the token dimension to match the longest batch and padding all other batches.

[Source for PyTorch](https://github.com/pytorch/pytorch/blob/77366ba637fd019fef0f15725676568824f6b944/torch/nn/modules/transformer.py#L167-L199)

# Motivation

In https://github.com/ggerganov/llama.cpp/pull/3624#issuecomment-1764197941 performance with tree-based speculative decoding did not improve over single sequence drafting. This is consistent with my own tests, but inconsistent with the results of the original paper and for other implementations. The lack of a batch dimension makes using multiple drafted sequences equivalent performance-wise to a single, longer sequence. Using a batch dimension would transform some operations from matrix-vector multiplications to matrix-matrix multiplications, mapping better to hardware. For example: the input tokens tensor that is currently a vector of n_tokens length would become a matrix of size n_batches x max_batch_length. Accordingly, the embeddings tensor would become a 3-dimensional tensor of size n_batches x n_embed x max_batch_length. Batch computations should be entirely independent, so for these 3-d tensor multiplications, some optimizations can be made to better map to hardware.

# Possible Implementation

I attempted adding the batch dimension myself but it's been slow-going due to limited understanding of the GGML internals. The biggest concern is determining where the batch dimension should be, as putting it last would likely cause severe memory fragmentation, while putting it first causes incompatibilities with the current operations that assume a certain shape.

It's important to note that adding a batch dimension *shouldn't* impact performance of single-batch inference, and prompt pre-filling should similarly be unaffected since that's not a true "batched" operation in the context of the batch dimension.

Obviously though, adding an entire dimension to all operations would incur a hefty development cost, so weighing whether the current approach is "good enough" should be carefully considered.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Batch processing should use a currently-missing batch dimension for all tensors #4526

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Batch processing should use a currently-missing batch dimension for all tensors #4526

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions