gpu: concurrently dispatch operations #2309

lshzh-ww · 2023-07-21T16:17:15Z

This is an early attempt to maximize throughput for all GPU backends. For now Metal backend run operations in a graph in a serial manner. That is, for an element-wise or a reduce operation that only needs a few hundreds threads, the GPU will only use a few hundreds threads, even if modern GPUs can run tens of thousands of threads at the same time. (cuda backend may also have this problem, I am not sure)

The commit resolve this by providing a ggml_graph_find_concurrency function to find if some
operations can be issued simultaneously by GPU. Before sending a graph to the GPU backend we can call the new function
to find concurrency in the graph. This will sort all the nodes and insert memory barrier nodes (nodes with op=GGML_OP_BARRIER) if necessary. One can simply dismiss the barrier nodes and issue operations sequentially, or try to concurrently issue all the operations between two barriers.

Taking the graph from #915 for example

Operation #4, #7 and #18 have no dependency, and can run concurrently.
The ggml_graph_find_concurrency function reordered the graph, put them together and inserted memory barrier before and after them.

On master branch, The net time for inference a 33B model is ~69.6 ms/tok on my M1 Max.

On This PR, The net time for inference a 33B model is ~65.5 ms/tok on my M1 Max.

However, for now we create the graph, find concurrency and encode the command every time for a single token (at least for Metal). The gain is pretty much killed by this, leading to a poor ~0.3 ms/tok speed up. So, @ggerganov @slaren , is it possible that we can have some mechanisms to tell the GPU if the graph is unchanged from last time in our new backend interface. By this way, we can save the time for encoding the command to GPU, and also the time for finding concurrency.

It's advised a program should only have one command buffer. This slow inference by ~1 ms on 33B model, but we may avoid it by reusing previous command queue.

This commit add a ggml_graph_find_concurrency function to find if some operations can be issued simultaneously by GPU. Before sending a graph to the GPU backend we can call the new function to find concurrency in the graph. This will sort all the nodes and insert memory barrier nodes if necessary. one can simply dismiss the barrier nodes and issue operations sequentially, or try to concuurrently issue all the operations between two barriers.

Using the new ggml functions.

lshzh-ww · 2023-07-21T16:34:57Z

This commit also paves the way for future optimizing. Since we have all the operations strictly ordered by their dependency, the profiling results are more easy to read.

(bandwidth usage for inferring 2 layers in llama 33B)

For every layer we can now reach ~93% of hardware bandwidth limit when lots of operations are issued concurrently, but only ~15% when there is a long dependency chain (critical path).

In future, GPU backends may identify this long chain by finding op-barrier-op-barrier-op-barrier-op-barrier pattern in reordered graph, and fuse these operations (by using prepared fused kernels or by generating them automatically).

slaren · 2023-07-21T17:15:03Z

I will look into this in more detail later, but here are some things to consider:

ggml backends interface, ggml-cuda refactor #2239 adds an implementation of this for CUDA. Currently, it adds too much overhead to use, but I expect that this can be solved https://github.com/ggerganov/llama.cpp/blob/e87840f9fdf3f2a958030c64a270321489162fa8/ggml-cuda.cu#L1568
Reusing graphs is not so easy because the parameters n_tokens and n_past change for each evaluation, so the graphs are not always identical
The same PR adds a new tensor allocator that needs to be aware of this so that operations that may be executed in parallel don't use the same memory. It's going to take a while until that is implemented.
https://github.com/ggerganov/llama.cpp/blob/e87840f9fdf3f2a958030c64a270321489162fa8/ggml.h#L478-L483

Generally I am inclined to leave the implementation details of this to each backend, but if there is a way to do some pre-processing that could simplify the implementation in the backends, that could be interesting. I haven't looked into the details of the code yet.

lshzh-ww · 2023-07-21T19:19:27Z

Reusing graphs is not so easy because the parameters n_tokens and n_past change for each evaluation, so the graphs are not always identical

Ahh... I didn't notice that there was a n_past parameter. May I ask if the topological structure (i.e. dependency relations) of the graph will change for each evaluation?

The same PR adds a new tensor allocator that needs to be aware of this so that operations that may be executed in parallel don't use the same memory. It's going to take a while until that is implemented.

It looks like tensors belong to the same layer don't overlap with each other. I limit the search depth to avoid running tensors from two layers at the same time.

Generally I am inclined to leave the implementation details of this to each backend, but if there is a way to do some pre-processing that could simplify the implementation in the backends, that could be interesting. I haven't looked into the details of the code yet.

I agree. Figuring out dependency is useful for all GPU backends, and may be also useful for the new tensor allocator.
How about this way: Having a ggml_func that parse the graph, then return an array like this
[1, 4, 7, BARRIER, 2, BARRIER, 3, BARRIER, 5, 6, 9, BARRIER], (#define BARRIER -1)
which means that nodes 1,4,7 can be run simultaneously, etc. Indexes between two BARRIER mean these nodes may be executed in parallel.

Backends can store this array in their ctx or elsewhere. As long as they are not informed the change of the graph structure, they can reuse this array to construct Stream (cuda) or insert memoryBarrier (metal). I am not familiar with cuda, but on metal the implementation will be like this Encode node 1, 4, 7; Add memoryBarrier; Encode node 2; Add memoryBarrier...

ggerganov · 2023-07-22T09:34:38Z

May I ask if the topological structure (i.e. dependency relations) of the graph will change for each evaluation?

The topology is the same, but the shapes of the tensors change.

In general, I like the proposed idea and thanks to this work I have learned about one more Metal feature: computeCommandEncoderWithDispatchType: MTLDispatchTypeConcurrent.
Great contributions @lshzh-ww!

I think the GGML_OP_BARRIER introduction is not great as it seems too intrusive to modify the compute graph, but the idea to obtain the concurrency as an integer array and have each backend utilize it is good. Maybe the "graph change" notification can be avoided for now and assume that each backend context operates always on the same graph topology.

If we do it this way and see that the approach is viable, we can think later to integrate it more tightly with ggml, following @slaren's backend work.

Ideally, the search_depth hack should be avoided if possible. It should be possible to check if 2 tensors overlap by using their data member + ggml_nbytes() size to find the memory region that is written to by each one (similar to what we do in ggml-metal.m to find the corresponding Metal buffer for a tensor).

Btw, since you already have the means, I'm curious what is the memory bandwidth utilization reported by Xcode for the F16 model? Also, is the 381 GB/s number for Q4_0 33B?

slaren · 2023-07-22T09:43:56Z

Having a ggml_func that parse the graph, then return an array like this [1, 4, 7, BARRIER, 2, BARRIER, 3, BARRIER, 5, 6, 9, BARRIER], (#define BARRIER -1) which means that nodes 1,4,7 can be run simultaneously, etc. Indexes between two BARRIER mean these nodes may be executed in parallel.

I think that wouldn't work entirely because in some cases you want groups of nodes to execute in parallel to other groups of nodes. For example parallel execution in CUDA when computing Kcur, Vcur and Qcur looks like this:

So you would need each element in the list to refer to a group of nodes, not just a node. And this process would need to be recursive, because inside a group there may also be multiple subgroups that can be executed in parallel. So what you end with is a graph, exactly the same that you started with.

lshzh-ww · 2023-07-23T02:57:01Z

Btw, since you already have the means, I'm curious what is the memory bandwidth utilization reported by Xcode for the F16 model? Also, is the 381 GB/s number for Q4_0 33B?

Yes all figures in my original post are for a Q4_0 33B model. Here is the 13B f16 model profiling result (under this PR, not master).

It looks like somehow our f16 kernel are more bound by integer/bit ops (high ALU usage, low f32 and f16 usage). Similar behaviors were also observed before PR #2248 for Q4_0 models, and that is because we had lots of bit ops on uint8. I haven't looked into the f16 kernel and I don't know what really limits the f16 kernel.

lshzh-ww · 2023-07-23T03:30:13Z

I think that wouldn't work entirely because in some cases you want groups of nodes to execute in parallel to other groups of nodes. For example parallel execution in CUDA when computing Kcur, Vcur and Qcur looks like this:

Thank for your graph! Now I understand what you want to do. On metal a compute pass feels like what cuda calls a Stream. It's only that metal also allows some concurrency within a Stream, separated by memory barriers. I simply put all commands in one Stream. I realize this is a quite coarse-grained way to control concurrency.

For example, in cuda you first issue tmpk, tmpq and tmpv concurrently, and you can immediately issue Kcur once tmpk finishes, and don't have to wait for tmpq and tmpv to be finished. In my code I also first issue tmpk, tmpq and tmpv, but I have to wait all of them to be finished to be able to issue Kcur, Vcur and Qcur.

I think I can also use multiple Stream in metal to achieve a fine-grained control on concurrency. For metal, synchronizing between different Stream will have larger overhead than having memory barriers in one Stream, so I need to do some more profiling to see if this change actually boost or at least achieve similar performance. If we are lucky, we can have a unified function that does the heavy lifting for all gpu backends. Although for now I haven't got an idea on how should we pass such information to all backends.

lshzh-ww · 2023-07-23T06:30:37Z

For metal, synchronizing between different Stream will have larger overhead than having memory barriers in one Stream, so I need to do some more profiling to see if this change actually boost or at least achieve similar performance.

On metal there is a ~40 us latency for switching to another Stream, and it looks like metal can only run two Stream at a time. For metal we better use the coarse-grained way proposed in this PR.

Since metal and cuda will use different schemes to run operations concurrently, it's hard to provide a function to serve the two backends. I will take the advice from @ggerganov and move the ggml_graph_find_concurrency to ggml_metal.m file. Shall I send this PR to the ggml-backends branch or should I wait for ggml-backends to be merged?

slaren · 2023-07-23T08:34:03Z

The ggml-backends branch is still going to take a while, so it would be better to open a PR to master instead.

lshzh-ww added 3 commits July 21, 2023 11:17

metal: only encode in one command buffer

c8e6ef1

It's advised a program should only have one command buffer. This slow inference by ~1 ms on 33B model, but we may avoid it by reusing previous command queue.

metal: issue operations concurrently if possible

6ee897a

Using the new ggml functions.

lshzh-ww requested a review from ggerganov July 21, 2023 16:17

fix white spaces

807ef88

lshzh-ww marked this pull request as draft July 21, 2023 20:04

lshzh-ww mentioned this pull request Jul 24, 2023

metal: concurrently dispatch commands #2358

Merged

lshzh-ww closed this Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gpu: concurrently dispatch operations #2309

gpu: concurrently dispatch operations #2309

Uh oh!

lshzh-ww commented Jul 21, 2023 •

edited

Loading

Uh oh!

lshzh-ww commented Jul 21, 2023 •

edited

Loading

Uh oh!

slaren commented Jul 21, 2023 •

edited

Loading

Uh oh!

lshzh-ww commented Jul 21, 2023 •

edited

Loading

Uh oh!

ggerganov commented Jul 22, 2023 •

edited

Loading

Uh oh!

slaren commented Jul 22, 2023 •

edited

Loading

Uh oh!

lshzh-ww commented Jul 23, 2023

Uh oh!

lshzh-ww commented Jul 23, 2023 •

edited

Loading

Uh oh!

lshzh-ww commented Jul 23, 2023

Uh oh!

slaren commented Jul 23, 2023

Uh oh!

Uh oh!

gpu: concurrently dispatch operations #2309

gpu: concurrently dispatch operations #2309

Uh oh!

Conversation

lshzh-ww commented Jul 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lshzh-ww commented Jul 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Jul 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lshzh-ww commented Jul 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jul 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Jul 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lshzh-ww commented Jul 23, 2023

Uh oh!

lshzh-ww commented Jul 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lshzh-ww commented Jul 23, 2023

Uh oh!

slaren commented Jul 23, 2023

Uh oh!

Uh oh!

lshzh-ww commented Jul 21, 2023 •

edited

Loading

lshzh-ww commented Jul 21, 2023 •

edited

Loading

slaren commented Jul 21, 2023 •

edited

Loading

lshzh-ww commented Jul 21, 2023 •

edited

Loading

ggerganov commented Jul 22, 2023 •

edited

Loading

slaren commented Jul 22, 2023 •

edited

Loading

lshzh-ww commented Jul 23, 2023 •

edited

Loading