metal : fuse non-sequential nodes #16102

ggerganov · 2025-09-19T09:36:59Z

Sample implementation of fusing over empty nodes (i.e. views, reshapes, etc.). For example, this sequence is now fusable:

ADD
VIEW
ADD
VIEW
ADD
...

am17an · 2025-09-19T10:11:48Z

ggml/src/ggml-impl.h

+        if (node->op != ops[i]) {
+            return false;
+        }
+        if (i < num_ops - 1 && !ggml_node_has_n_uses(cgraph, node_idx[i], 1)) {


ggml_node_has_n_uses returns false if the node is a view, probably this is too restrictive

Can you clarify? The intention is to avoid passing view nodes to ggml_can_fuse in the first place.

From what I understand, the user has to explicitly remove the empty nodes before passing through this function. It would be nice if this function can do the remove, because then ggml_can_fuse would work without each backend removing the empty ops before calling this function

calvin2021y · 2025-09-23T05:26:30Z

patch error:

error: patch failed: ggml/src/ggml-impl.h:602
error: ggml/src/ggml-impl.h: patch does not apply
error: patch failed: ggml/src/ggml-metal/ggml-metal-common.cpp:184
error: ggml/src/ggml-metal/ggml-metal-common.cpp: patch does not apply

joseph777111 · 2025-09-28T22:45:46Z

@ggerganov Latest llama.cpp (b887d2f), which includes this commit restores token/s speed to gpt-oss-20B on METAL (about 15-20 token/s). Thank you! 😋

joseph777111 · 2025-09-28T22:57:44Z

@ggerganov Question though: after #16220, I noticed gpt-oss-20b use more memory than Qwen-3-30B-A3B-Thinking-2507 (Unsloth's Q2_K_XL) quant, which is the exact same file size (~12.5 GB). Prior to #16220, both models had roughly the same memory usage. Even after this commit, Qwen-3-30B-A3B-Thinking-2507, still has the same memory usage as it did before, but gpt-oss-20b seems to use more now. Is this down to Qwen-3-30B-A3B-Thinking-2507 architecture or is there an issue with gpt-oss-20b? My guess is the shape of gpt-oss-20b's tensors is different than Qwen-3-30B-A3B-Thinking-2507's causing it to use more memory. But, I don't have information to verify this. Any insight and help from you and the rest of llama.cpp would be greatly appreciated. 🤔

joseph777111 · 2025-09-28T23:11:15Z

Interestingly enough, I just tested running gpt-oss-20B with --no-mmap on my 2020 M1 MacBook Pro - it retains its 15-20 tokens/s, and the data which can't fit in unified memory is able to be pushed to swap. Now, my computer responsive - even when the model is running inference. I'm still curious about my prior question though. I don't think gpt-oss-20b should be using more memory. But, I could be naïveté and be completely wrong on my assumption. 🤔

Note: I want to make something thing clear: this is not a trouble-shooting or "how do I make this model run on my machine" inquiry. I know you guys are extremely busy. I wouldn't dream of pestering you with such inquiries that Google would be able to help me troubleshoot first.

Thank you for your time and consideration. I appreciate it. 😋

ggerganov · 2025-09-29T06:20:30Z

Memory usage should not have increased.

For 16GB macbook, you can use this command for optimal performance:

llama-server -hf ggml-org/gpt-oss-20b-GGUF --n-cpu-moe 12 -c 32768 --jinja --no-mmap

github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Sep 19, 2025

am17an reviewed Sep 19, 2025

View reviewed changes

ggerganov mentioned this pull request Sep 20, 2025

ggml : extend ggml_can_fuse to work with non-sequential nodes #16123

Merged

am17an mentioned this pull request Sep 23, 2025

CUDA: add a fused top-K MoE kernel #16130

Merged

ggerganov force-pushed the gg/metal-fuse-non-seq branch 2 times, most recently from 62eccbf to 56095da Compare September 25, 2025 08:35

ggerganov marked this pull request as ready for review September 25, 2025 08:38

ggerganov added 3 commits September 27, 2025 10:23

metal : fuse non-sequential nodes

778f0ba

cont : add comment

db56553

cont : simplify bounds checks

832723f

ggerganov force-pushed the gg/metal-fuse-non-seq branch from 56095da to 832723f Compare September 27, 2025 07:26

ggerganov merged commit 3b53634 into master Sep 28, 2025
66 of 70 checks passed

ggerganov deleted the gg/metal-fuse-non-seq branch September 28, 2025 06:34

This comment was marked as duplicate.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

metal : fuse non-sequential nodes #16102

metal : fuse non-sequential nodes #16102

Uh oh!

ggerganov commented Sep 19, 2025

Uh oh!

am17an Sep 19, 2025

Uh oh!

ggerganov Sep 20, 2025

Uh oh!

am17an Sep 22, 2025

Uh oh!

calvin2021y commented Sep 23, 2025

Uh oh!

Uh oh!

joseph777111 commented Sep 28, 2025

Uh oh!

joseph777111 commented Sep 28, 2025

Uh oh!

This comment was marked as duplicate.

joseph777111 commented Sep 28, 2025

Uh oh!

ggerganov commented Sep 29, 2025

Uh oh!

Uh oh!

metal : fuse non-sequential nodes #16102

metal : fuse non-sequential nodes #16102

Uh oh!

Conversation

ggerganov commented Sep 19, 2025

Uh oh!

am17an Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Sep 20, 2025

Choose a reason for hiding this comment

Uh oh!

am17an Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

calvin2021y commented Sep 23, 2025

Uh oh!

Uh oh!

joseph777111 commented Sep 28, 2025

Uh oh!

joseph777111 commented Sep 28, 2025

Uh oh!

This comment was marked as duplicate.

joseph777111 commented Sep 28, 2025

Uh oh!

ggerganov commented Sep 29, 2025

Uh oh!

Uh oh!