-
Notifications
You must be signed in to change notification settings - Fork 13.2k
ggml : WebGPU add TRANSPOSE and RESHAPE to supported ops #15695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml : WebGPU add TRANSPOSE and RESHAPE to supported ops #15695
Conversation
This commit disables flash attention in the webgpu test. The motivation for this is that it seem like flash attention might not be supported for webgpu when using llvmpipe (not 100% though as it works for me locally but I'm running a different version of mesa). This is an snipped from the log: ```console 2025-08-31T10:49:20.2265119Z 27: /home/runner/work/llama.cpp/llama.cpp/ggml/src/ggml-backend.cpp:789: pre-allocated tensor (cache_v_l0 (view) (permuted) (transposed)) in a buffer (WebGPU) that cannot run the operation (TRANSPOSE) 2025-08-31T10:49:20.2266911Z 27: /home/runner/work/llama.cpp/llama.cpp/ggml/src/ggml-backend.cpp:789: pre-allocated tensor (cache_v_l0 (view) (permuted) (transposed)) in a buffer (WebGPU) that cannot run the operation (TRANSPOSE) 2025-08-31T10:49:20.2268797Z 27: ␛[34m0.01.085.256␛[0m ␛[35mW llama_context: layer 0 is assigned to device WebGPU but the Flash Attention tensor is assigned to device CPU (usually due to missing support) 2025-08-31T10:49:20.2269971Z 27: ␛[0m␛[34m0.01.085.262␛[0m ␛[35mW llama_context: Flash Attention was auto, set to disabled 2025-08-31T10:49:20.2271542Z 27: ␛[0m␛[34m0.01.085.302␛[0m ␛[35mW llama_context: layer 0 is assigned to device WebGPU but the Flash Attention tensor is assigned to device CPU (usually due to missing support) 2025-08-31T10:49:20.2272942Z 27: ␛[0m␛[34m0.01.085.303␛[0m ␛[35mW llama_context: Flash Attention was auto, set to disabled 2025-08-31T10:49:20.2274119Z 27: ␛[0m␛[34m0.01.085.334␛[0m ␛[35mW llama_context: layer 0 is assigned to device WebGPU but the Flash Attention tensor is assigned to device CPU (usually due to missing support) 2025-08-31T10:49:20.2275271Z 27: ␛[0m␛[34m0.01.085.335␛[0m ␛[35mW llama_context: Flash Attention was auto, set to disabled 2025-08-31T10:49:20.2276529Z 27: ␛[0m␛[34m0.01.085.470␛[0m ␛[35mW llama_context: layer 0 is assigned to device WebGPU but the Flash Attention tensor is assigned to device CPU (usually due to missing support) 2025-08-31T10:49:20.2277776Z 27: ␛[0m␛[34m0.01.085.471␛[0m ␛[35mW llama_context: Flash Attention was auto, set to disabled 2025-08-31T10:49:20.2279308Z 27: ␛[0m/home/runner/work/llama.cpp/llama.cpp/ggml/src/ggml-backend.cpp:789: pre-allocated tensor (cache_v_l0 (view) (permuted) (transposed)) in a buffer (WebGPU) that cannot run the operation (TRANSPOSE) 2025-08-31T10:49:20.2281488Z 27: /home/runner/work/llama.cpp/llama.cpp/ggml/src/ggml-backend.cpp:789: pre-allocated tensor (cache_v_l0 (view) (permuted) (transposed)) in a buffer (WebGPU) that cannot run the operation (TRANSPOSE) ```
This reverts commit 522ae98.
Just want to see if this enabled CI webgpu tests to pass.
This is still a bug somewhere, it should not be hidden by disabling the test. |
My intention was to not to disable the test, but instead for WebGPU, set flash attention to off. My reasoning here was that the default was previously off unless I'm mistaken, and that the recent change to enable flash attention by default might be what is causing this test to start to fail. I was more curious if this would allow the test to pass. |
The intention is to enable flash attention only if the backend supports it. If doing that check causes the backend to crash, then that indicates a problem somewhere, and should not be hidden. Try this change to fix it instead: diff --git a/ggml/src/ggml-webgpu/ggml-webgpu.cpp b/ggml/src/ggml-webgpu/ggml-webgpu.cpp
index 32f1e304e..4e3f152a7 100644
--- a/ggml/src/ggml-webgpu/ggml-webgpu.cpp
+++ b/ggml/src/ggml-webgpu/ggml-webgpu.cpp
@@ -1062,6 +1062,8 @@ static bool ggml_backend_webgpu_device_supports_op(ggml_backend_dev_t dev, const
case GGML_OP_NONE:
case GGML_OP_VIEW:
case GGML_OP_PERMUTE:
+ case GGML_OP_TRANSPOSE:
+ case GGML_OP_RESHAPE:
return true;
case GGML_OP_CPY:
case GGML_OP_SET_ROWS: |
This reverts commit 52794d3.
This commit adds support for the TRANSPOSE and RESHAPE operations in the ggml webgpu backend. Co-authored-by: Diego Devesa <[email protected]>
It should also be added here (and return llama.cpp/ggml/src/ggml-webgpu/ggml-webgpu.cpp Lines 611 to 614 in ea8412f
|
This commit add GGML_OP_TRANSPOSE and GGML_OP_RESHAPE cases to the ggml_webgpu_encode_node function in ggml-webgpu.cpp. The actual operation are not implemented yet, and are left as TODOs. Co-authored-by: Sigbjørn Skjæret <[email protected]>
…s [no ci] Remove TODO comment about unimplemented operations.
…s [no ci] Move GGML_OP_TRANSPOSE and GGML_OP_RESHAPE to the other no-op cases.
) * ggml : WebGPU add TRANSPOSE and RESHAPE to supported ops This commit adds support for the TRANSPOSE and RESHAPE operations in the ggml webgpu backend. Co-authored-by: Diego Devesa <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>
This commit adds support for the TRANSPOSE and RESHAPE operations in the
ggml webgpu backend.
Co-authored-by: Diego Devesa [email protected]