Skip to content

Conversation

danbev
Copy link
Member

@danbev danbev commented Aug 31, 2025

This commit adds support for the TRANSPOSE and RESHAPE operations in the
ggml webgpu backend.

Co-authored-by: Diego Devesa [email protected]

This commit disables flash attention in the webgpu test.

The motivation for this is that it seem like flash attention might not
be supported for webgpu when using llvmpipe (not 100% though as it works
for me locally but I'm running a different version of mesa). This is an
snipped from the log:
```console
2025-08-31T10:49:20.2265119Z 27: /home/runner/work/llama.cpp/llama.cpp/ggml/src/ggml-backend.cpp:789: pre-allocated tensor (cache_v_l0 (view) (permuted) (transposed)) in a buffer (WebGPU) that cannot run the operation (TRANSPOSE)
2025-08-31T10:49:20.2266911Z 27: /home/runner/work/llama.cpp/llama.cpp/ggml/src/ggml-backend.cpp:789: pre-allocated tensor (cache_v_l0 (view) (permuted) (transposed)) in a buffer (WebGPU) that cannot run the operation (TRANSPOSE)
2025-08-31T10:49:20.2268797Z 27: ␛[34m0.01.085.256␛[0m ␛[35mW llama_context: layer 0 is assigned to device WebGPU but the Flash Attention tensor is assigned to device CPU (usually due to missing support)
2025-08-31T10:49:20.2269971Z 27: ␛[0m␛[34m0.01.085.262␛[0m ␛[35mW llama_context: Flash Attention was auto, set to disabled
2025-08-31T10:49:20.2271542Z 27: ␛[0m␛[34m0.01.085.302␛[0m ␛[35mW llama_context: layer 0 is assigned to device WebGPU but the Flash Attention tensor is assigned to device CPU (usually due to missing support)
2025-08-31T10:49:20.2272942Z 27: ␛[0m␛[34m0.01.085.303␛[0m ␛[35mW llama_context: Flash Attention was auto, set to disabled
2025-08-31T10:49:20.2274119Z 27: ␛[0m␛[34m0.01.085.334␛[0m ␛[35mW llama_context: layer 0 is assigned to device WebGPU but the Flash Attention tensor is assigned to device CPU (usually due to missing support)
2025-08-31T10:49:20.2275271Z 27: ␛[0m␛[34m0.01.085.335␛[0m ␛[35mW llama_context: Flash Attention was auto, set to disabled
2025-08-31T10:49:20.2276529Z 27: ␛[0m␛[34m0.01.085.470␛[0m ␛[35mW llama_context: layer 0 is assigned to device WebGPU but the Flash Attention tensor is assigned to device CPU (usually due to missing support)
2025-08-31T10:49:20.2277776Z 27: ␛[0m␛[34m0.01.085.471␛[0m ␛[35mW llama_context: Flash Attention was auto, set to disabled
2025-08-31T10:49:20.2279308Z 27: ␛[0m/home/runner/work/llama.cpp/llama.cpp/ggml/src/ggml-backend.cpp:789: pre-allocated tensor (cache_v_l0 (view) (permuted) (transposed)) in a buffer (WebGPU) that cannot run the operation (TRANSPOSE)
2025-08-31T10:49:20.2281488Z 27: /home/runner/work/llama.cpp/llama.cpp/ggml/src/ggml-backend.cpp:789: pre-allocated tensor (cache_v_l0 (view) (permuted) (transposed)) in a buffer (WebGPU) that cannot run the operation (TRANSPOSE)
```
@github-actions github-actions bot added the devops improvements to build systems and github actions label Aug 31, 2025
danbev added 2 commits August 31, 2025 16:32
Just want to see if this enabled CI webgpu tests to pass.
@slaren
Copy link
Member

slaren commented Aug 31, 2025

This is still a bug somewhere, it should not be hidden by disabling the test.

@github-actions github-actions bot added the testing Everything test related label Aug 31, 2025
@danbev
Copy link
Member Author

danbev commented Aug 31, 2025

This is still a bug somewhere, it should not be hidden by disabling the test.

My intention was to not to disable the test, but instead for WebGPU, set flash attention to off. My reasoning here was that the default was previously off unless I'm mistaken, and that the recent change to enable flash attention by default might be what is causing this test to start to fail. I was more curious if this would allow the test to pass.

@slaren
Copy link
Member

slaren commented Aug 31, 2025

The intention is to enable flash attention only if the backend supports it. If doing that check causes the backend to crash, then that indicates a problem somewhere, and should not be hidden.

Try this change to fix it instead:

diff --git a/ggml/src/ggml-webgpu/ggml-webgpu.cpp b/ggml/src/ggml-webgpu/ggml-webgpu.cpp
index 32f1e304e..4e3f152a7 100644
--- a/ggml/src/ggml-webgpu/ggml-webgpu.cpp
+++ b/ggml/src/ggml-webgpu/ggml-webgpu.cpp
@@ -1062,6 +1062,8 @@ static bool ggml_backend_webgpu_device_supports_op(ggml_backend_dev_t dev, const
         case GGML_OP_NONE:
         case GGML_OP_VIEW:
         case GGML_OP_PERMUTE:
+        case GGML_OP_TRANSPOSE:
+        case GGML_OP_RESHAPE:
             return true;
         case GGML_OP_CPY:
         case GGML_OP_SET_ROWS:

danbev and others added 2 commits August 31, 2025 17:55
This commit adds support for the TRANSPOSE and RESHAPE operations in the
ggml webgpu backend.

Co-authored-by: Diego Devesa <[email protected]>
@danbev danbev changed the title ci : disable flash attention for webgpu test ggml : WebGPU add TRANSPOSE and RESHAPE to supported ops Aug 31, 2025
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Aug 31, 2025
@danbev danbev marked this pull request as ready for review August 31, 2025 18:34
@CISC
Copy link
Collaborator

CISC commented Aug 31, 2025

It should also be added here (and return true (not that it matters (yet) as the return is not checked)):

case GGML_OP_NONE:
case GGML_OP_VIEW:
case GGML_OP_PERMUTE:
return false;

This commit add GGML_OP_TRANSPOSE and GGML_OP_RESHAPE cases to the
ggml_webgpu_encode_node function in ggml-webgpu.cpp. The actual
operation are not implemented yet, and are left as TODOs.

Co-authored-by: Sigbjørn Skjæret <[email protected]>
…s [no ci]

Remove TODO comment about unimplemented operations.
…s [no ci]

Move GGML_OP_TRANSPOSE and GGML_OP_RESHAPE to the other no-op cases.
@danbev danbev merged commit 77dee9d into ggml-org:master Sep 1, 2025
1 check passed
@danbev danbev deleted the ci-webgpu-flash-attention-disable branch September 1, 2025 13:41
walidbr pushed a commit to walidbr/llama.cpp that referenced this pull request Sep 7, 2025
)

* ggml : WebGPU add TRANSPOSE and RESHAPE to supported ops

This commit adds support for the TRANSPOSE and RESHAPE operations in the
ggml webgpu backend.

Co-authored-by: Diego Devesa <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants