Skip to content

CUDA graphs break quantized K cache #7492

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
JohannesGaessler opened this issue May 23, 2024 · 5 comments · Fixed by #7565
Closed

CUDA graphs break quantized K cache #7492

JohannesGaessler opened this issue May 23, 2024 · 5 comments · Fixed by #7565
Labels
bug Something isn't working Nvidia GPU Issues specific to Nvidia GPUs

Comments

@JohannesGaessler
Copy link
Collaborator

As of right now it is already possible on master to quantize the K cache via e.g. -ctk q8_0. However, this is currently broken on master for batch size 1. Disabling CUDA graphs via the environment variable GGML_CUDA_DISABLE_GRAPHS=1 fixes the issue.

cc: @agray3

@JohannesGaessler JohannesGaessler added bug Something isn't working Nvidia GPU Issues specific to Nvidia GPUs bug-unconfirmed and removed bug-unconfirmed labels May 23, 2024
@JohannesGaessler
Copy link
Collaborator Author

To reproduce for example:

./perplexity --file wikitext-2-raw/wiki.test.raw --n-gpu-layers 99 --model models/opt/${model_name}-${quantization}.gguf --chunks 1 -ctk q8_0 -b 1

@agray3
Copy link
Contributor

agray3 commented May 23, 2024

Noted - I'll take a look.

agray3 added a commit to agray3/llama.cpp that referenced this issue May 24, 2024
@agray3
Copy link
Contributor

agray3 commented May 24, 2024

It seems that this case has some conditions which are causing some extra memory copies in matrix multiplication nodes that are causing an issue. A workaround is at agray3@a5fd193 which disables CUDA graphs for the specific conditions. However I'm not sure if this is overkill and may unnecessarily disable CUDA graphs for other cases where they are desired - do you have any insight? I'm not yet sure what is causing the issue with the copies, it may be related to kernel parameter changes (like I already dealt with for other copy kernels).

@JohannesGaessler
Copy link
Collaborator Author

I noticed this bug in the context of working on quantized KV cache for FlashAttention. These kernels (by themselves) do not do any memory copies but still suffer from this problem. So perhaps the issue is (also) the conversion of FP32 to the quantized format?

@agray3
Copy link
Contributor

agray3 commented May 27, 2024

I've now identified the issue - see the fix at #7565. The problem was that the implementation was assuming that only a single CUDA kernel was associated with nodes of type GGML_OP_CPY when performing param updates to the graph for each token. But in this case, there are 2 such kernels (cpy_f32_f16 and cp_f32_q). The perplexity reproducer is now working for me with this fix (and CUDA graphs give a nice 23% speedup on my A100-PCIe system)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants