-
Notifications
You must be signed in to change notification settings - Fork 11.7k
Fixed OpenLLaMA 3b CUDA mul_mat_vec_q #2144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed OpenLLaMA 3b CUDA mul_mat_vec_q #2144
Conversation
The generation looks fine, but
|
f437f6a
to
e6b7a4f
Compare
Thank you for pointing this out, I should have checked it. The value for |
e6b7a4f
to
52f90f2
Compare
I added another change: the padding is now memset to 0. Though unlikely, it is possible for the unset memory to encode a NaN which could make the sum over the entire row NaN. |
So, if I understand correctly, the code depends on the value of |
I see your point. How about just adding another define that controls the size to which the vector and the last row are extended? I would prefer not to increase |
Sure, that sounds even better. |
52f90f2
to
518c822
Compare
518c822
to
a7ce53f
Compare
Fixes #2136 . The issue was that the weight tensors had row sizes that are not multiples of 128. I fixed it by padding the quantized vector and the last row of the weight tensors to a multiple of 128. This is preferable over adding checks to the CUDA kernels since it has better performance.