Skip to content

Misc. bug: Performance degradation after attention sinks merge #15174

@abrimogard

Description

@abrimogard

Name and Version

./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 6119 (cd6983d)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

llama-server \
--timeout 3000 \
--n-gpu-layers 999 \
--host 0.0.0.0 \
--port 9999 \
--ctx_size 24576 \
--flash_attn \
--temp 0.60 \
--top_k 20 \
--top_p 0.95 \
--min_p 0 \
--presence_penalty 1.5 \
--no-mmap \
--model /Qwen_Qwen3-30B-A3B-Q5_K_L.gguf

Problem description & steps to reproduce

Token generation performance degraded after building CUDA: attention sinks for mma FlashAttention #15157
Model: Qwen_Qwen3-30B-A3B-Q5_K_L.gguf

Before: ~120 TPS
After: ~60 TPS

First Bad Commit

CUDA: attention sinks for mma FlashAttention #15157

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions