Skip to content

gpu: concurrently dispatch operations #2309

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

lshzh-ww
Copy link
Contributor

@lshzh-ww lshzh-ww commented Jul 21, 2023

This is an early attempt to maximize throughput for all GPU backends. For now Metal backend run operations in a graph in a serial manner. That is, for an element-wise or a reduce operation that only needs a few hundreds threads, the GPU will only use a few hundreds threads, even if modern GPUs can run tens of thousands of threads at the same time. (cuda backend may also have this problem, I am not sure)

The commit resolve this by providing a ggml_graph_find_concurrency function to find if some
operations can be issued simultaneously by GPU. Before sending a graph to the GPU backend we can call the new function
to find concurrency in the graph. This will sort all the nodes and insert memory barrier nodes (nodes with op=GGML_OP_BARRIER) if necessary. One can simply dismiss the barrier nodes and issue operations sequentially, or try to concurrently issue all the operations between two barriers.

Taking the graph from #915 for example

Operation #4, #7 and #18 have no dependency, and can run concurrently.
The ggml_graph_find_concurrency function reordered the graph, put them together and inserted memory barrier before and after them.

On master branch, The net time for inference a 33B model is ~69.6 ms/tok on my M1 Max.

On This PR, The net time for inference a 33B model is ~65.5 ms/tok on my M1 Max.

However, for now we create the graph, find concurrency and encode the command every time for a single token (at least for Metal). The gain is pretty much killed by this, leading to a poor ~0.3 ms/tok speed up. So, @ggerganov @slaren , is it possible that we can have some mechanisms to tell the GPU if the graph is unchanged from last time in our new backend interface. By this way, we can save the time for encoding the command to GPU, and also the time for finding concurrency.

lshzh-ww added 3 commits July 21, 2023 11:17
It's advised a program should only have one command buffer. This slow
inference by ~1 ms on 33B model, but we may avoid it by reusing
previous command queue.
This commit add a ggml_graph_find_concurrency function to find if some
operations can be issued simultaneously by GPU.

Before sending a graph to the GPU backend we can call the new function
to find concurrency in the graph. This will sort all the nodes and
insert memory barrier nodes if necessary. one can simply dismiss the
barrier nodes and issue operations sequentially, or try to concuurrently
issue all the operations between two barriers.
@lshzh-ww lshzh-ww requested a review from ggerganov July 21, 2023 16:17
@lshzh-ww
Copy link
Contributor Author

lshzh-ww commented Jul 21, 2023

This commit also paves the way for future optimizing. Since we have all the operations strictly ordered by their dependency, the profiling results are more easy to read.

(bandwidth usage for inferring 2 layers in llama 33B)

For every layer we can now reach ~93% of hardware bandwidth limit when lots of operations are issued concurrently, but only ~15% when there is a long dependency chain (critical path).

In future, GPU backends may identify this long chain by finding op-barrier-op-barrier-op-barrier-op-barrier pattern in reordered graph, and fuse these operations (by using prepared fused kernels or by generating them automatically).

@slaren
Copy link
Member

slaren commented Jul 21, 2023

I will look into this in more detail later, but here are some things to consider:

Generally I am inclined to leave the implementation details of this to each backend, but if there is a way to do some pre-processing that could simplify the implementation in the backends, that could be interesting. I haven't looked into the details of the code yet.

@lshzh-ww
Copy link
Contributor Author

lshzh-ww commented Jul 21, 2023

  • Reusing graphs is not so easy because the parameters n_tokens and n_past change for each evaluation, so the graphs are not always identical

Ahh... I didn't notice that there was a n_past parameter. May I ask if the topological structure (i.e. dependency relations) of the graph will change for each evaluation?

  • The same PR adds a new tensor allocator that needs to be aware of this so that operations that may be executed in parallel don't use the same memory. It's going to take a while until that is implemented.

It looks like tensors belong to the same layer don't overlap with each other. I limit the search depth to avoid running tensors from two layers at the same time.

Generally I am inclined to leave the implementation details of this to each backend, but if there is a way to do some pre-processing that could simplify the implementation in the backends, that could be interesting. I haven't looked into the details of the code yet.

I agree. Figuring out dependency is useful for all GPU backends, and may be also useful for the new tensor allocator.
How about this way: Having a ggml_func that parse the graph, then return an array like this
[1, 4, 7, BARRIER, 2, BARRIER, 3, BARRIER, 5, 6, 9, BARRIER], (#define BARRIER -1)
which means that nodes 1,4,7 can be run simultaneously, etc. Indexes between two BARRIER mean these nodes may be executed in parallel.

Backends can store this array in their ctx or elsewhere. As long as they are not informed the change of the graph structure, they can reuse this array to construct Stream (cuda) or insert memoryBarrier (metal). I am not familiar with cuda, but on metal the implementation will be like this Encode node 1, 4, 7; Add memoryBarrier; Encode node 2; Add memoryBarrier...

@lshzh-ww lshzh-ww marked this pull request as draft July 21, 2023 20:04
@ggerganov
Copy link
Member

ggerganov commented Jul 22, 2023

May I ask if the topological structure (i.e. dependency relations) of the graph will change for each evaluation?

The topology is the same, but the shapes of the tensors change.

In general, I like the proposed idea and thanks to this work I have learned about one more Metal feature: computeCommandEncoderWithDispatchType: MTLDispatchTypeConcurrent.
Great contributions @lshzh-ww!

I think the GGML_OP_BARRIER introduction is not great as it seems too intrusive to modify the compute graph, but the idea to obtain the concurrency as an integer array and have each backend utilize it is good. Maybe the "graph change" notification can be avoided for now and assume that each backend context operates always on the same graph topology.

If we do it this way and see that the approach is viable, we can think later to integrate it more tightly with ggml, following @slaren's backend work.

Ideally, the search_depth hack should be avoided if possible. It should be possible to check if 2 tensors overlap by using their data member + ggml_nbytes() size to find the memory region that is written to by each one (similar to what we do in ggml-metal.m to find the corresponding Metal buffer for a tensor).

Btw, since you already have the means, I'm curious what is the memory bandwidth utilization reported by Xcode for the F16 model? Also, is the 381 GB/s number for Q4_0 33B?

@slaren
Copy link
Member

slaren commented Jul 22, 2023

Having a ggml_func that parse the graph, then return an array like this [1, 4, 7, BARRIER, 2, BARRIER, 3, BARRIER, 5, 6, 9, BARRIER], (#define BARRIER -1) which means that nodes 1,4,7 can be run simultaneously, etc. Indexes between two BARRIER mean these nodes may be executed in parallel.

I think that wouldn't work entirely because in some cases you want groups of nodes to execute in parallel to other groups of nodes. For example parallel execution in CUDA when computing Kcur, Vcur and Qcur looks like this:
image

So you would need each element in the list to refer to a group of nodes, not just a node. And this process would need to be recursive, because inside a group there may also be multiple subgroups that can be executed in parallel. So what you end with is a graph, exactly the same that you started with.

@lshzh-ww
Copy link
Contributor Author

Btw, since you already have the means, I'm curious what is the memory bandwidth utilization reported by Xcode for the F16 model? Also, is the 381 GB/s number for Q4_0 33B?

Yes all figures in my original post are for a Q4_0 33B model. Here is the 13B f16 model profiling result (under this PR, not master).

It looks like somehow our f16 kernel are more bound by integer/bit ops (high ALU usage, low f32 and f16 usage). Similar behaviors were also observed before PR #2248 for Q4_0 models, and that is because we had lots of bit ops on uint8. I haven't looked into the f16 kernel and I don't know what really limits the f16 kernel.

@lshzh-ww
Copy link
Contributor Author

lshzh-ww commented Jul 23, 2023

I think that wouldn't work entirely because in some cases you want groups of nodes to execute in parallel to other groups of nodes. For example parallel execution in CUDA when computing Kcur, Vcur and Qcur looks like this:

Thank for your graph! Now I understand what you want to do. On metal a compute pass feels like what cuda calls a Stream. It's only that metal also allows some concurrency within a Stream, separated by memory barriers. I simply put all commands in one Stream. I realize this is a quite coarse-grained way to control concurrency.

For example, in cuda you first issue tmpk, tmpq and tmpv concurrently, and you can immediately issue Kcur once tmpk finishes, and don't have to wait for tmpq and tmpv to be finished. In my code I also first issue tmpk, tmpq and tmpv, but I have to wait all of them to be finished to be able to issue Kcur, Vcur and Qcur.

I think I can also use multiple Stream in metal to achieve a fine-grained control on concurrency. For metal, synchronizing between different Stream will have larger overhead than having memory barriers in one Stream, so I need to do some more profiling to see if this change actually boost or at least achieve similar performance. If we are lucky, we can have a unified function that does the heavy lifting for all gpu backends. Although for now I haven't got an idea on how should we pass such information to all backends.

@lshzh-ww
Copy link
Contributor Author

For metal, synchronizing between different Stream will have larger overhead than having memory barriers in one Stream, so I need to do some more profiling to see if this change actually boost or at least achieve similar performance.

On metal there is a ~40 us latency for switching to another Stream, and it looks like metal can only run two Stream at a time. For metal we better use the coarse-grained way proposed in this PR.

Since metal and cuda will use different schemes to run operations concurrently, it's hard to provide a function to serve the two backends. I will take the advice from @ggerganov and move the ggml_graph_find_concurrency to ggml_metal.m file. Shall I send this PR to the ggml-backends branch or should I wait for ggml-backends to be merged?

@slaren
Copy link
Member

slaren commented Jul 23, 2023

The ggml-backends branch is still going to take a while, so it would be better to open a PR to master instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants