-
Notifications
You must be signed in to change notification settings - Fork 12.7k
CUDA: attention sinks for mma FlashAttention #15157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: attention sinks for mma FlashAttention #15157
Conversation
The reason you get the same performance for ubatch sizes >= 2048 is likely because the default batch size of |
I thought we made it at some point that a physical batch size > 2048 would also raise the logical batch size but I guess I misremembered. |
5090 (WSL)
|
I was thinking today that the clamp with -9999 instead of -inf might not be correct and needs an extra look. nvm: this was fixed when you fused the activation function |
It might have been a fluke, I regenerated the same prompt a few times and it didn't happen again. |
When I set |
When I calculate perplexity for the entire wikitext-2 test set I get virtually no difference. 244.1368 for a threshold of -20, 243.8748 for a threshold of -50. |
@@ -3532,7 +3532,8 @@ static bool ggml_backend_cuda_device_supports_op(ggml_backend_dev_t dev, const g | |||
return op->src[1]->ne[0] == 576 && op->src[2]->ne[0] == 512 && op->src[3] && gqa_ratio % 16 == 0; | |||
} | |||
// TODO: more general-purpose attention sink support [TAG_ATTN_SINKS] | |||
if (op->src[4] && op->src[0]->ne[0] != 64 && op->src[0]->ne[0] != 128) { // currently only sinks for head_size 64 and 128 are supported | |||
if (op->src[4] && !fp16_mma_available(ggml_cuda_info().devices[dev_ctx->device].cc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the !fp16_mma_available
check needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for model support because the vector kernels I think cover all currently available models with attention sinks. But this enables running tests with attention sinks and head sizes != 64/128 so I thought it would be better to adjust.
Just by looking at the code I am not seeing anything that would be wrong with it. For the next few days I should be available with low latency so I'll monitor the issues to see whether there is a systematic issue. |
This may have caused some performance degradation. Anecdotally, I was running Qwen3-30B-A3B (Q5_XL) on a single RTX 3090. Token generation speed was ~120 TPS before this PR and now it's ~60 TPS after building latest. |
@@ -282,7 +282,7 @@ void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst | |||
const enum ggml_prec prec = ggml_flash_attn_ext_get_prec(KQV); | |||
|
|||
// TODO: currently only vec implementation for sinks is supported [TAG_ATTN_SINKS] | |||
if (sinks) { | |||
if (sinks && !fp16_mma_available(cc)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JohannesGaessler I think this breaks Volta, since fp16_mma_available
is true but the wmma kernel doesn't yet support attention sinks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, this should be turing_mma_available
.
This PR adds support for attention sinks to the CUDA FlashAttention kernel using tensor cores. On Turing or newer attention sinks are now fully supported.
Performance changes