Fixed OpenLLaMA 3b CUDA mul_mat_vec_q #2144

JohannesGaessler · 2023-07-08T09:39:06Z

Fixes #2136 . The issue was that the weight tensors had row sizes that are not multiples of 128. I fixed it by padding the quantized vector and the last row of the weight tensors to a multiple of 128. This is preferable over adding checks to the CUDA kernels since it has better performance.

slaren · 2023-07-08T11:40:44Z

The generation looks fine, but compute-sanitizer still reports out of bounds accesses:

========= Invalid __global__ read of size 1 bytes
=========     at 0xb10 in /home/slaren/code/llama.cpp/ggml-cuda.cu:1231:vec_dot_q4_0_q8_1(const void *, const block_q8_1 *, int)
=========     by thread (16,0,0) in block (0,3199,0)
=========     Address 0xdb257e402 is out of bounds
=========     and is 3 bytes after the nearest allocation at 0xdb2000000 of size 5760000 bytes
=========     Device Frame:/home/slaren/code/llama.cpp/ggml-cuda.cu:1388:void mul_mat_vec_q<(int)32, (int)4, block_q4_0, &vec_dot_q4_0_q8_1>(const void *, const void *, float *, int, int) [0xaf0]

JohannesGaessler · 2023-07-08T13:04:20Z

Thank you for pointing this out, I should have checked it. The value for QUANTIZE_BLOCK_SIZE was incorrect, it should have been 256 instead of 128. One q4 block contains 4 integers so one warp accesses 8 blocks = 256 elements at once.

JohannesGaessler · 2023-07-08T13:34:36Z

I added another change: the padding is now memset to 0. Though unlikely, it is possible for the unset memory to encode a NaN which could make the sum over the entire row NaN.

slaren · 2023-07-08T13:42:32Z

So, if I understand correctly, the code depends on the value of CUDA_QUANTIZE_BLOCK_SIZE to ensure that rows sizes are always a multiple of 256. I am a bit concerned that, what previously was just a parameter that could be tuned for the hardware, now will result in incorrect results if changed. To fix this, maybe in quantize_row_q8_1_cuda, k could be rounded up explicitly to a multiple of 256, before rounding it to the block size.

JohannesGaessler · 2023-07-08T13:54:35Z

I see your point. How about just adding another define that controls the size to which the vector and the last row are extended? I would prefer not to increase k in quantize_row_q8_1 since I want any changes to k to be in the same place that the corresponding memory is allocated.

slaren · 2023-07-08T14:01:40Z

Sure, that sounds even better.

ggml-cuda.cu

JohannesGaessler force-pushed the cuda-openllama-3b-fix branch from f437f6a to e6b7a4f Compare July 8, 2023 13:01

JohannesGaessler force-pushed the cuda-openllama-3b-fix branch from e6b7a4f to 52f90f2 Compare July 8, 2023 13:32

JohannesGaessler force-pushed the cuda-openllama-3b-fix branch from 52f90f2 to 518c822 Compare July 8, 2023 14:11

slaren reviewed Jul 8, 2023

View reviewed changes

ggml-cuda.cu Outdated Show resolved Hide resolved

Fixed OpenLLaMA 3b CUDA mul_mat_vec_q

a7ce53f

JohannesGaessler force-pushed the cuda-openllama-3b-fix branch from 518c822 to a7ce53f Compare July 8, 2023 16:22

slaren approved these changes Jul 8, 2023

View reviewed changes

JohannesGaessler merged commit 6463955 into ggml-org:master Jul 8, 2023

JohannesGaessler mentioned this pull request Jul 15, 2023

ggml backends interface, ggml-cuda refactor #2230

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixed OpenLLaMA 3b CUDA mul_mat_vec_q #2144

Fixed OpenLLaMA 3b CUDA mul_mat_vec_q #2144

Uh oh!

JohannesGaessler commented Jul 8, 2023

Uh oh!

slaren commented Jul 8, 2023

Uh oh!

JohannesGaessler commented Jul 8, 2023

Uh oh!

JohannesGaessler commented Jul 8, 2023

Uh oh!

slaren commented Jul 8, 2023

Uh oh!

JohannesGaessler commented Jul 8, 2023

Uh oh!

slaren commented Jul 8, 2023

Uh oh!

Uh oh!

Uh oh!

Fixed OpenLLaMA 3b CUDA mul_mat_vec_q #2144

Fixed OpenLLaMA 3b CUDA mul_mat_vec_q #2144

Uh oh!

Conversation

JohannesGaessler commented Jul 8, 2023

Uh oh!

slaren commented Jul 8, 2023

Uh oh!

JohannesGaessler commented Jul 8, 2023

Uh oh!

JohannesGaessler commented Jul 8, 2023

Uh oh!

slaren commented Jul 8, 2023

Uh oh!

JohannesGaessler commented Jul 8, 2023

Uh oh!

slaren commented Jul 8, 2023

Uh oh!

Uh oh!

Uh oh!