-
Notifications
You must be signed in to change notification settings - Fork 12.6k
Closed
Labels
Description
Name and Version
version 4585
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-quantize
Command line
./llama-quantize --imatrix /models/OLMo-2-1124-7B-Instruct-GGUF/allenai_OLMo-2-1124-7B-Instruct.imatrix /models/OLMo-2-1124-7B-Instruct-GGUF/allenai_OLMo-2-1124-7B-Instruct-f32.gguf /models/OLMo-2-1124-7B-Instruct-GGUF/allenai_OLMo-2-1124-7B-Instruct-Q5_K_M.gguf Q5_K_M
Problem description & steps to reproduce
Without imatrix I don't get any issues.
Quantizing OLMo-2 7B to Q5_K_M, Q5_K_S, Q4_K_M, and Q4_K_S, and Q2_K with imatrix results in:
blk.7.attn_q.weight - [ 4096, 4096, 1, 1], type = f32, converting to q4_K .. ggml_validate_row_data: found nan value at block 48
ggml_validate_row_data: found nan value at block 16
blk.7.attn_q.weight - [ 4096, 4096, 1, 1], type = f32, converting to q5_K .. ggml_validate_row_data: found nan value at block 48
ggml_validate_row_data: found nan value at block 16
blk.7.attn_q.weight - [ 4096, 4096, 1, 1], type = f32, converting to q2_K .. ggml_validate_row_data: found nan value at block 48
ggml_validate_row_data: found nan value at block 16
All other sizes quantize without issue..
Additionally, the 13B model fails in a different way on IQ2_M and IQ2_S:
[ 95/ 443] blk.8.attn_q.weight - [ 5120, 5120, 1, 1], type = f32, converting to iq2_xs .. /llama.cpp/ggml/src/ggml-quants.c:3279: fatal error
Oops: found point 4 not on grid: 0 1 0 0 0 0 0 0
libggml-base.so(+0x159cb)[0x72a78fe039cb]
libggml-base.so(ggml_abort+0x15f)[0x72a78fe03d6f]
libggml-base.so(+0x3bcbb)[0x72a78fe29cbb]
libggml-base.so(quantize_iq2_xs+0x81)[0x72a78fe45691]
libggml-base.so(ggml_quantize_chunk+0x371)[0x72a78fe12431]
libllama.so(+0xeaa35)[0x72a78ffaca35]
libllama.so(llama_model_quantize+0xf4)[0x72a78ffae094]
./llama-quantize(+0x17d6a)[0x60055e597d6a]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x72a78f8b5d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x72a78f8b5e40]
./llama-quantize(+0x18c25)[0x60055e598c25]
All other sizes have no issues
I've uploaded both F32 conversions as well as imatrix files here:
https://huggingface.co/bartowski/PleaseIgnore_uploaded_for_testing