You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I reviewed the Discussions, and have a new bug or useful enhancement to share.
Feature Description
While doing experiments with batched processing I noticed that performance of batching two sequences with length N was exactly equal to the performance of running a single batch inference of length 2N. This seemed peculiar to me, as I understood most implementations to use a separate dimension for batching, so performance between the two scenarios should have been different in some way. Looking through the code, I fail to see the batch dimension in any of the tensors constructed in the various build_*() functions. From what I can tell, batches are done via extending the tokens dimension to match the combined length of all batches, while other implementations extend the token dimension to match the longest batch and padding all other batches.
In #3624 (comment) performance with tree-based speculative decoding did not improve over single sequence drafting. This is consistent with my own tests, but inconsistent with the results of the original paper and for other implementations. The lack of a batch dimension makes using multiple drafted sequences equivalent performance-wise to a single, longer sequence. Using a batch dimension would transform some operations from matrix-vector multiplications to matrix-matrix multiplications, mapping better to hardware. For example: the input tokens tensor that is currently a vector of n_tokens length would become a matrix of size n_batches x max_batch_length. Accordingly, the embeddings tensor would become a 3-dimensional tensor of size n_batches x n_embed x max_batch_length. Batch computations should be entirely independent, so for these 3-d tensor multiplications, some optimizations can be made to better map to hardware.
Possible Implementation
I attempted adding the batch dimension myself but it's been slow-going due to limited understanding of the GGML internals. The biggest concern is determining where the batch dimension should be, as putting it last would likely cause severe memory fragmentation, while putting it first causes incompatibilities with the current operations that assume a certain shape.
It's important to note that adding a batch dimension shouldn't impact performance of single-batch inference, and prompt pre-filling should similarly be unaffected since that's not a true "batched" operation in the context of the batch dimension.
Obviously though, adding an entire dimension to all operations would incur a hefty development cost, so weighing whether the current approach is "good enough" should be carefully considered.
The text was updated successfully, but these errors were encountered:
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Feature Description
While doing experiments with batched processing I noticed that performance of batching two sequences with length N was exactly equal to the performance of running a single batch inference of length 2N. This seemed peculiar to me, as I understood most implementations to use a separate dimension for batching, so performance between the two scenarios should have been different in some way. Looking through the code, I fail to see the batch dimension in any of the tensors constructed in the various
build_*()
functions. From what I can tell, batches are done via extending the tokens dimension to match the combined length of all batches, while other implementations extend the token dimension to match the longest batch and padding all other batches.Source for PyTorch
Motivation
In #3624 (comment) performance with tree-based speculative decoding did not improve over single sequence drafting. This is consistent with my own tests, but inconsistent with the results of the original paper and for other implementations. The lack of a batch dimension makes using multiple drafted sequences equivalent performance-wise to a single, longer sequence. Using a batch dimension would transform some operations from matrix-vector multiplications to matrix-matrix multiplications, mapping better to hardware. For example: the input tokens tensor that is currently a vector of n_tokens length would become a matrix of size n_batches x max_batch_length. Accordingly, the embeddings tensor would become a 3-dimensional tensor of size n_batches x n_embed x max_batch_length. Batch computations should be entirely independent, so for these 3-d tensor multiplications, some optimizations can be made to better map to hardware.
Possible Implementation
I attempted adding the batch dimension myself but it's been slow-going due to limited understanding of the GGML internals. The biggest concern is determining where the batch dimension should be, as putting it last would likely cause severe memory fragmentation, while putting it first causes incompatibilities with the current operations that assume a certain shape.
It's important to note that adding a batch dimension shouldn't impact performance of single-batch inference, and prompt pre-filling should similarly be unaffected since that's not a true "batched" operation in the context of the batch dimension.
Obviously though, adding an entire dimension to all operations would incur a hefty development cost, so weighing whether the current approach is "good enough" should be carefully considered.
The text was updated successfully, but these errors were encountered: