Skip to content

Batch processing should use a currently-missing batch dimension for all tensors #4526

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
4 tasks done
AutonomicPerfectionist opened this issue Dec 18, 2023 · 4 comments
Closed
4 tasks done
Labels
enhancement New feature or request stale

Comments

@AutonomicPerfectionist
Copy link
Contributor

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

While doing experiments with batched processing I noticed that performance of batching two sequences with length N was exactly equal to the performance of running a single batch inference of length 2N. This seemed peculiar to me, as I understood most implementations to use a separate dimension for batching, so performance between the two scenarios should have been different in some way. Looking through the code, I fail to see the batch dimension in any of the tensors constructed in the various build_*() functions. From what I can tell, batches are done via extending the tokens dimension to match the combined length of all batches, while other implementations extend the token dimension to match the longest batch and padding all other batches.

Source for PyTorch

Motivation

In #3624 (comment) performance with tree-based speculative decoding did not improve over single sequence drafting. This is consistent with my own tests, but inconsistent with the results of the original paper and for other implementations. The lack of a batch dimension makes using multiple drafted sequences equivalent performance-wise to a single, longer sequence. Using a batch dimension would transform some operations from matrix-vector multiplications to matrix-matrix multiplications, mapping better to hardware. For example: the input tokens tensor that is currently a vector of n_tokens length would become a matrix of size n_batches x max_batch_length. Accordingly, the embeddings tensor would become a 3-dimensional tensor of size n_batches x n_embed x max_batch_length. Batch computations should be entirely independent, so for these 3-d tensor multiplications, some optimizations can be made to better map to hardware.

Possible Implementation

I attempted adding the batch dimension myself but it's been slow-going due to limited understanding of the GGML internals. The biggest concern is determining where the batch dimension should be, as putting it last would likely cause severe memory fragmentation, while putting it first causes incompatibilities with the current operations that assume a certain shape.

It's important to note that adding a batch dimension shouldn't impact performance of single-batch inference, and prompt pre-filling should similarly be unaffected since that's not a true "batched" operation in the context of the batch dimension.

Obviously though, adding an entire dimension to all operations would incur a hefty development cost, so weighing whether the current approach is "good enough" should be carefully considered.

@AutonomicPerfectionist AutonomicPerfectionist added the enhancement New feature or request label Dec 18, 2023
@ggerganov
Copy link
Member

Batched decoding is supported. You don't need an explicit dimension to do batched decoding.

On CUDA with F16 models, the processing scales linearly for small batches (<32). With quantum models, not so much, but it still scales.

See the batched-bench example for more info

@AutonomicPerfectionist
Copy link
Contributor Author

I know batched decoding is supported, what I meant is that the performance could be improved by using a batch dimension as other implementations do

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Mar 18, 2024
Copy link
Contributor

github-actions bot commented Apr 2, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests

2 participants