-
Notifications
You must be signed in to change notification settings - Fork 11.7k
gpu: concurrently dispatch operations #2309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
It's advised a program should only have one command buffer. This slow inference by ~1 ms on 33B model, but we may avoid it by reusing previous command queue.
This commit add a ggml_graph_find_concurrency function to find if some operations can be issued simultaneously by GPU. Before sending a graph to the GPU backend we can call the new function to find concurrency in the graph. This will sort all the nodes and insert memory barrier nodes if necessary. one can simply dismiss the barrier nodes and issue operations sequentially, or try to concuurrently issue all the operations between two barriers.
Using the new ggml functions.
I will look into this in more detail later, but here are some things to consider:
Generally I am inclined to leave the implementation details of this to each backend, but if there is a way to do some pre-processing that could simplify the implementation in the backends, that could be interesting. I haven't looked into the details of the code yet. |
Ahh... I didn't notice that there was a
It looks like tensors belong to the same layer don't overlap with each other. I limit the search depth to avoid running tensors from two layers at the same time.
I agree. Figuring out dependency is useful for all GPU backends, and may be also useful for the new tensor allocator. Backends can store this array in their |
The topology is the same, but the shapes of the tensors change. In general, I like the proposed idea and thanks to this work I have learned about one more Metal feature: I think the If we do it this way and see that the approach is viable, we can think later to integrate it more tightly with Ideally, the Btw, since you already have the means, I'm curious what is the memory bandwidth utilization reported by Xcode for the F16 model? Also, is the 381 GB/s number for |
Yes all figures in my original post are for a |
Thank for your graph! Now I understand what you want to do. On metal a For example, in cuda you first issue I think I can also use multiple |
On metal there is a ~40 us latency for switching to another Since metal and cuda will use different schemes to run operations concurrently, it's hard to provide a function to serve the two backends. I will take the advice from @ggerganov and move the |
The |
This is an early attempt to maximize throughput for all GPU backends. For now Metal backend run operations in a graph in a serial manner. That is, for an element-wise or a reduce operation that only needs a few hundreds threads, the GPU will only use a few hundreds threads, even if modern GPUs can run tens of thousands of threads at the same time. (cuda backend may also have this problem, I am not sure)
The commit resolve this by providing a
ggml_graph_find_concurrency
function to find if someoperations can be issued simultaneously by GPU. Before sending a graph to the GPU backend we can call the new function
to find concurrency in the graph. This will sort all the nodes and insert memory barrier nodes (nodes with
op=GGML_OP_BARRIER
) if necessary. One can simply dismiss the barrier nodes and issue operations sequentially, or try to concurrently issue all the operations between two barriers.Taking the graph from #915 for example


Operation
#4, #7 and #18
have no dependency, and can run concurrently.The
ggml_graph_find_concurrency
function reordered the graph, put them together and inserted memory barrier before and after them.On master branch, The net time for inference a 33B model is ~69.6 ms/tok on my M1 Max.


On This PR, The net time for inference a 33B model is ~65.5 ms/tok on my M1 Max.
However, for now we create the graph, find concurrency and encode the command every time for a single token (at least for Metal). The gain is pretty much killed by this, leading to a poor ~0.3 ms/tok speed up. So, @ggerganov @slaren , is it possible that we can have some mechanisms to tell the GPU if the graph is unchanged from last time in our new backend interface. By this way, we can save the time for encoding the command to GPU, and also the time for finding concurrency.