-
Notifications
You must be signed in to change notification settings - Fork 12.7k
Description
Name and Version
[docker@a242c844efbf ~]$ llama-cli-vulkan --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
version: 4384 (14b699e)
built with cc (GCC) 14.2.1 20240910 for x86_64-pc-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-bench
Problem description & steps to reproduce
llama-batched-bench-vulkan -m /models/Qwen2.5-Coder-32B-Instruct-Q4_K_S.gguf -ngl 99 -npp 512 -ntg 128 -npl 1,2,4,8,16 -pps
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
build: 4384 (14b699e) with cc (GCC) 14.2.1 20240910 for x86_64-pc-linux-gnu
main: n_kv_max = 4096, n_batch = 2048, n_ubatch = 512, flash_attn = 0, is_pp_shared = 1, n_gpu_layers = 99, n_threads = 12, n_threads_batch = 12
PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
---|---|---|---|---|---|---|---|---|---|
512 | 128 | 1 | 640 | 1.578 | 324.39 | 3.838 | 33.35 | 5.416 | 118.17 |
512 | 128 | 2 | 768 | 1.555 | 329.33 | 31.047 | 8.25 | 32.602 | 23.56 |
512 | 128 | 4 | 1024 | 1.570 | 326.11 | 33.209 | 15.42 | 34.779 | 29.44 |
512 | 128 | 8 | 1536 | 1.571 | 325.94 | 37.241 | 27.50 | 38.812 | 39.58 |
512 | 128 | 16 | 2560 | 1.575 | 325.05 | 28.106 | 72.87 | 29.681 | 86.25 |
I understand scaling at some batch sizes might be less than ideal. But at worst I would expect small regressions if no scaling can be achieved at all (due to overhead of batched processing). Right now, for batch sizes 2 and 4 especially there is a massive performance loss. Can anything be done to improve this situation? Poor batched performance makes speculative decoding on the vulkan backend unusable unfortunately.
First Bad Commit
No response
Relevant log output
No response