Skip to content

Fixed OpenLLaMA 3b CUDA mul_mat_vec_q #2144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

JohannesGaessler
Copy link
Collaborator

Fixes #2136 . The issue was that the weight tensors had row sizes that are not multiples of 128. I fixed it by padding the quantized vector and the last row of the weight tensors to a multiple of 128. This is preferable over adding checks to the CUDA kernels since it has better performance.

@slaren
Copy link
Member

slaren commented Jul 8, 2023

The generation looks fine, but compute-sanitizer still reports out of bounds accesses:

========= Invalid __global__ read of size 1 bytes
=========     at 0xb10 in /home/slaren/code/llama.cpp/ggml-cuda.cu:1231:vec_dot_q4_0_q8_1(const void *, const block_q8_1 *, int)
=========     by thread (16,0,0) in block (0,3199,0)
=========     Address 0xdb257e402 is out of bounds
=========     and is 3 bytes after the nearest allocation at 0xdb2000000 of size 5760000 bytes
=========     Device Frame:/home/slaren/code/llama.cpp/ggml-cuda.cu:1388:void mul_mat_vec_q<(int)32, (int)4, block_q4_0, &vec_dot_q4_0_q8_1>(const void *, const void *, float *, int, int) [0xaf0]

@JohannesGaessler JohannesGaessler force-pushed the cuda-openllama-3b-fix branch from f437f6a to e6b7a4f Compare July 8, 2023 13:01
@JohannesGaessler
Copy link
Collaborator Author

Thank you for pointing this out, I should have checked it. The value for QUANTIZE_BLOCK_SIZE was incorrect, it should have been 256 instead of 128. One q4 block contains 4 integers so one warp accesses 8 blocks = 256 elements at once.

@JohannesGaessler JohannesGaessler force-pushed the cuda-openllama-3b-fix branch from e6b7a4f to 52f90f2 Compare July 8, 2023 13:32
@JohannesGaessler
Copy link
Collaborator Author

I added another change: the padding is now memset to 0. Though unlikely, it is possible for the unset memory to encode a NaN which could make the sum over the entire row NaN.

@slaren
Copy link
Member

slaren commented Jul 8, 2023

So, if I understand correctly, the code depends on the value of CUDA_QUANTIZE_BLOCK_SIZE to ensure that rows sizes are always a multiple of 256. I am a bit concerned that, what previously was just a parameter that could be tuned for the hardware, now will result in incorrect results if changed. To fix this, maybe in quantize_row_q8_1_cuda, k could be rounded up explicitly to a multiple of 256, before rounding it to the block size.

@JohannesGaessler
Copy link
Collaborator Author

I see your point. How about just adding another define that controls the size to which the vector and the last row are extended? I would prefer not to increase k in quantize_row_q8_1 since I want any changes to k to be in the same place that the corresponding memory is allocated.

@slaren
Copy link
Member

slaren commented Jul 8, 2023

Sure, that sounds even better.

@JohannesGaessler JohannesGaessler force-pushed the cuda-openllama-3b-fix branch from 52f90f2 to 518c822 Compare July 8, 2023 14:11
@JohannesGaessler JohannesGaessler force-pushed the cuda-openllama-3b-fix branch from 518c822 to a7ce53f Compare July 8, 2023 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Generating garbage output on CUDA when GGML_CUDA_FORCE_DMMV is set to false
2 participants