-
Notifications
You must be signed in to change notification settings - Fork 12.7k
Closed
Labels
Description
Name and Version
./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 6119 (cd6983d)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
llama-server \
--timeout 3000 \
--n-gpu-layers 999 \
--host 0.0.0.0 \
--port 9999 \
--ctx_size 24576 \
--flash_attn \
--temp 0.60 \
--top_k 20 \
--top_p 0.95 \
--min_p 0 \
--presence_penalty 1.5 \
--no-mmap \
--model /Qwen_Qwen3-30B-A3B-Q5_K_L.gguf
Problem description & steps to reproduce
Token generation performance degraded after building CUDA: attention sinks for mma FlashAttention #15157
Model: Qwen_Qwen3-30B-A3B-Q5_K_L.gguf
Before: ~120 TPS
After: ~60 TPS
First Bad Commit
CUDA: attention sinks for mma FlashAttention #15157