Closed
Description
As of right now it is already possible on master to quantize the K cache via e.g. -ctk q8_0
. However, this is currently broken on master for batch size 1. Disabling CUDA graphs via the environment variable GGML_CUDA_DISABLE_GRAPHS=1
fixes the issue.
cc: @agray3
Activity
JohannesGaessler commentedon May 23, 2024
To reproduce for example:
agray3 commentedon May 23, 2024
Noted - I'll take a look.
Disable CUDA Graphs for noncontiguous src0 and non fp16 src1
agray3 commentedon May 24, 2024
It seems that this case has some conditions which are causing some extra memory copies in matrix multiplication nodes that are causing an issue. A workaround is at agray3@a5fd193 which disables CUDA graphs for the specific conditions. However I'm not sure if this is overkill and may unnecessarily disable CUDA graphs for other cases where they are desired - do you have any insight? I'm not yet sure what is causing the issue with the copies, it may be related to kernel parameter changes (like I already dealt with for other copy kernels).
JohannesGaessler commentedon May 24, 2024
I noticed this bug in the context of working on quantized KV cache for FlashAttention. These kernels (by themselves) do not do any memory copies but still suffer from this problem. So perhaps the issue is (also) the conversion of FP32 to the quantized format?
agray3 commentedon May 27, 2024
I've now identified the issue - see the fix at #7565. The problem was that the implementation was assuming that only a single CUDA kernel was associated with nodes of type GGML_OP_CPY when performing param updates to the graph for each token. But in this case, there are 2 such kernels (
cpy_f32_f16
andcp_f32_q
). The perplexity reproducer is now working for me with this fix (and CUDA graphs give a nice 23% speedup on my A100-PCIe system)