-
Notifications
You must be signed in to change notification settings - Fork 13.2k
metal : fuse non-sequential nodes #16102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ggml/src/ggml-impl.h
Outdated
if (node->op != ops[i]) { | ||
return false; | ||
} | ||
if (i < num_ops - 1 && !ggml_node_has_n_uses(cgraph, node_idx[i], 1)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ggml_node_has_n_uses
returns false if the node is a view, probably this is too restrictive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify? The intention is to avoid passing view nodes to ggml_can_fuse
in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand, the user has to explicitly remove the empty nodes before passing through this function. It would be nice if this function can do the remove, because then ggml_can_fuse
would work without each backend removing the empty ops before calling this function
patch error: error: patch failed: ggml/src/ggml-impl.h:602
error: ggml/src/ggml-impl.h: patch does not apply
error: patch failed: ggml/src/ggml-metal/ggml-metal-common.cpp:184
error: ggml/src/ggml-metal/ggml-metal-common.cpp: patch does not apply |
62eccbf
to
56095da
Compare
56095da
to
832723f
Compare
@ggerganov Latest llama.cpp (b887d2f), which includes this commit restores token/s speed to gpt-oss-20B on METAL (about 15-20 token/s). Thank you! 😋 |
@ggerganov Question though: after #16220, I noticed gpt-oss-20b use more memory than Qwen-3-30B-A3B-Thinking-2507 (Unsloth's Q2_K_XL) quant, which is the exact same file size (~12.5 GB). Prior to #16220, both models had roughly the same memory usage. Even after this commit, Qwen-3-30B-A3B-Thinking-2507, still has the same memory usage as it did before, but gpt-oss-20b seems to use more now. Is this down to Qwen-3-30B-A3B-Thinking-2507 architecture or is there an issue with gpt-oss-20b? My guess is the shape of gpt-oss-20b's tensors is different than Qwen-3-30B-A3B-Thinking-2507's causing it to use more memory. But, I don't have information to verify this. Any insight and help from you and the rest of llama.cpp would be greatly appreciated. 🤔 |
This comment was marked as duplicate.
This comment was marked as duplicate.
Interestingly enough, I just tested running gpt-oss-20B with --no-mmap on my 2020 M1 MacBook Pro - it retains its 15-20 tokens/s, and the data which can't fit in unified memory is able to be pushed to swap. Now, my computer responsive - even when the model is running inference. I'm still curious about my prior question though. I don't think gpt-oss-20b should be using more memory. But, I could be naïveté and be completely wrong on my assumption. 🤔 Note: I want to make something thing clear: this is not a trouble-shooting or "how do I make this model run on my machine" inquiry. I know you guys are extremely busy. I wouldn't dream of pestering you with such inquiries that Google would be able to help me troubleshoot first. Thank you for your time and consideration. I appreciate it. 😋 |
Memory usage should not have increased. For 16GB macbook, you can use this command for optimal performance: llama-server -hf ggml-org/gpt-oss-20b-GGUF --n-cpu-moe 12 -c 32768 --jinja --no-mmap |
ref #16087 (comment)
Sample implementation of fusing over empty nodes (i.e. views, reshapes, etc.). For example, this sequence is now fusable: