Skip to content

Misc. bug: Vulkan backend with 7900XTX has severe performance dropoff at some batch sizes #10966

@Mushoz

Description

@Mushoz

Name and Version

[docker@a242c844efbf ~]$ llama-cli-vulkan --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
version: 4384 (14b699e)
built with cc (GCC) 14.2.1 20240910 for x86_64-pc-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-bench

Problem description & steps to reproduce

llama-batched-bench-vulkan -m /models/Qwen2.5-Coder-32B-Instruct-Q4_K_S.gguf -ngl 99 -npp 512 -ntg 128 -npl 1,2,4,8,16 -pps
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
build: 4384 (14b699e) with cc (GCC) 14.2.1 20240910 for x86_64-pc-linux-gnu

main: n_kv_max = 4096, n_batch = 2048, n_ubatch = 512, flash_attn = 0, is_pp_shared = 1, n_gpu_layers = 99, n_threads = 12, n_threads_batch = 12

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 128 1 640 1.578 324.39 3.838 33.35 5.416 118.17
512 128 2 768 1.555 329.33 31.047 8.25 32.602 23.56
512 128 4 1024 1.570 326.11 33.209 15.42 34.779 29.44
512 128 8 1536 1.571 325.94 37.241 27.50 38.812 39.58
512 128 16 2560 1.575 325.05 28.106 72.87 29.681 86.25

I understand scaling at some batch sizes might be less than ideal. But at worst I would expect small regressions if no scaling can be achieved at all (due to overhead of batched processing). Right now, for batch sizes 2 and 4 especially there is a massive performance loss. Can anything be done to improve this situation? Poor batched performance makes speculative decoding on the vulkan backend unusable unfortunately.

First Bad Commit

No response

Relevant log output

No response

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions