Skip to content

Conversation

ggerganov
Copy link
Member

ref #16087 (comment)

Sample implementation of fusing over empty nodes (i.e. views, reshapes, etc.). For example, this sequence is now fusable:

ADD
VIEW
ADD
VIEW
ADD
...

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Sep 19, 2025
if (node->op != ops[i]) {
return false;
}
if (i < num_ops - 1 && !ggml_node_has_n_uses(cgraph, node_idx[i], 1)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ggml_node_has_n_uses returns false if the node is a view, probably this is too restrictive

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify? The intention is to avoid passing view nodes to ggml_can_fuse in the first place.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand, the user has to explicitly remove the empty nodes before passing through this function. It would be nice if this function can do the remove, because then ggml_can_fuse would work without each backend removing the empty ops before calling this function

@calvin2021y
Copy link

patch error:

error: patch failed: ggml/src/ggml-impl.h:602
error: ggml/src/ggml-impl.h: patch does not apply
error: patch failed: ggml/src/ggml-metal/ggml-metal-common.cpp:184
error: ggml/src/ggml-metal/ggml-metal-common.cpp: patch does not apply

@ggerganov ggerganov force-pushed the gg/metal-fuse-non-seq branch 2 times, most recently from 62eccbf to 56095da Compare September 25, 2025 08:35
@ggerganov ggerganov marked this pull request as ready for review September 25, 2025 08:38
@ggerganov ggerganov force-pushed the gg/metal-fuse-non-seq branch from 56095da to 832723f Compare September 27, 2025 07:26
@ggerganov ggerganov merged commit 3b53634 into master Sep 28, 2025
66 of 70 checks passed
@ggerganov ggerganov deleted the gg/metal-fuse-non-seq branch September 28, 2025 06:34
@joseph777111
Copy link

@ggerganov Latest llama.cpp (b887d2f), which includes this commit restores token/s speed to gpt-oss-20B on METAL (about 15-20 token/s). Thank you! 😋

@joseph777111
Copy link

@ggerganov Question though: after #16220, I noticed gpt-oss-20b use more memory than Qwen-3-30B-A3B-Thinking-2507 (Unsloth's Q2_K_XL) quant, which is the exact same file size (~12.5 GB). Prior to #16220, both models had roughly the same memory usage. Even after this commit, Qwen-3-30B-A3B-Thinking-2507, still has the same memory usage as it did before, but gpt-oss-20b seems to use more now. Is this down to Qwen-3-30B-A3B-Thinking-2507 architecture or is there an issue with gpt-oss-20b? My guess is the shape of gpt-oss-20b's tensors is different than Qwen-3-30B-A3B-Thinking-2507's causing it to use more memory. But, I don't have information to verify this. Any insight and help from you and the rest of llama.cpp would be greatly appreciated. 🤔

@joseph777111

This comment was marked as duplicate.

@joseph777111
Copy link

Interestingly enough, I just tested running gpt-oss-20B with --no-mmap on my 2020 M1 MacBook Pro - it retains its 15-20 tokens/s, and the data which can't fit in unified memory is able to be pushed to swap. Now, my computer responsive - even when the model is running inference. I'm still curious about my prior question though. I don't think gpt-oss-20b should be using more memory. But, I could be naïveté and be completely wrong on my assumption. 🤔

Note: I want to make something thing clear: this is not a trouble-shooting or "how do I make this model run on my machine" inquiry. I know you guys are extremely busy. I wouldn't dream of pestering you with such inquiries that Google would be able to help me troubleshoot first.

Thank you for your time and consideration. I appreciate it. 😋

@ggerganov
Copy link
Member Author

Memory usage should not have increased.

For 16GB macbook, you can use this command for optimal performance:

llama-server -hf ggml-org/gpt-oss-20b-GGUF --n-cpu-moe 12 -c 32768 --jinja --no-mmap

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants