Closed
Description
As of right now it is already possible on master to quantize the K cache via e.g. -ctk q8_0
. However, this is currently broken on master for batch size 1. Disabling CUDA graphs via the environment variable GGML_CUDA_DISABLE_GRAPHS=1
fixes the issue.
cc: @agray3