Skip to content

Quantization produces invalid decoder.token_embedding.weight in resulting file #2906

Open
@shyperson0

Description

@shyperson0

After quantizing with q3_k the resulting model is unusable. Quantize runs without errors, it appears to describe some wrong metadata on the model.(?)
whisper_model_load: tensor 'decoder.token_embedding.weight' has wrong size in model file: got 5705095, expected 2190735360
Tested on latest master e27fd6f
Relevant logs below.

> quantize models/ggml-tiny.en.bin models/ggml-tiny.en.q3_k 11

whisper_model_quantize: n_vocab       = 51864
whisper_model_quantize: n_audio_ctx   = 1500
whisper_model_quantize: n_audio_state = 384
whisper_model_quantize: n_audio_head  = 6
whisper_model_quantize: n_audio_layer = 4
whisper_model_quantize: n_text_ctx    = 448
whisper_model_quantize: n_text_state  = 384
whisper_model_quantize: n_text_head   = 6
whisper_model_quantize: n_text_layer  = 4
whisper_model_quantize: n_mels        = 80
whisper_model_quantize: ftype (src)   = 1
whisper_model_quantize: qntvr (src)   = 0
whisper_model_quantize: ftype (dst)   = 2011
whisper_model_quantize: qntvr (dst)   = 2
whisper_model_quantize: loading model from 'models/ggml-tiny.en.bin'
                                    decoder.positional_embedding - [  384,   448,     1], type =    f32 size =    0.656 MB
                                    encoder.positional_embedding - [  384,  1500,     1], type =    f32 size =    2.197 MB
                                  decoder.token_embedding.weight - [  384, 51864,     1], type =    f16 size =    75.97 MB ->     8.16 MB
                                  decoder.blocks.0.mlp_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                    decoder.blocks.0.mlp_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   decoder.blocks.0.mlp.0.weight - [  384,  1536,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     decoder.blocks.0.mlp.0.bias - [ 1536,     1,     1], type =    f32 size =    0.006 MB
                                   decoder.blocks.0.mlp.2.weight - [ 1536,   384,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     decoder.blocks.0.mlp.2.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                 decoder.blocks.0.attn_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   decoder.blocks.0.attn_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                              decoder.blocks.0.attn.query.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                decoder.blocks.0.attn.query.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                decoder.blocks.0.attn.key.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                              decoder.blocks.0.attn.value.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                decoder.blocks.0.attn.value.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                decoder.blocks.0.attn.out.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                  decoder.blocks.0.attn.out.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                           decoder.blocks.0.cross_attn_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                             decoder.blocks.0.cross_attn_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                        decoder.blocks.0.cross_attn.query.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                          decoder.blocks.0.cross_attn.query.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                          decoder.blocks.0.cross_attn.key.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                        decoder.blocks.0.cross_attn.value.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                          decoder.blocks.0.cross_attn.value.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                          decoder.blocks.0.cross_attn.out.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                            decoder.blocks.0.cross_attn.out.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                  decoder.blocks.1.mlp_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                    decoder.blocks.1.mlp_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   decoder.blocks.1.mlp.0.weight - [  384,  1536,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     decoder.blocks.1.mlp.0.bias - [ 1536,     1,     1], type =    f32 size =    0.006 MB
                                   decoder.blocks.1.mlp.2.weight - [ 1536,   384,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     decoder.blocks.1.mlp.2.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                 decoder.blocks.1.attn_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   decoder.blocks.1.attn_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                              decoder.blocks.1.attn.query.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                decoder.blocks.1.attn.query.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                decoder.blocks.1.attn.key.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                              decoder.blocks.1.attn.value.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                decoder.blocks.1.attn.value.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                decoder.blocks.1.attn.out.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                  decoder.blocks.1.attn.out.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                           decoder.blocks.1.cross_attn_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                             decoder.blocks.1.cross_attn_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                        decoder.blocks.1.cross_attn.query.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                          decoder.blocks.1.cross_attn.query.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                          decoder.blocks.1.cross_attn.key.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                        decoder.blocks.1.cross_attn.value.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                          decoder.blocks.1.cross_attn.value.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                          decoder.blocks.1.cross_attn.out.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                            decoder.blocks.1.cross_attn.out.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                  decoder.blocks.2.mlp_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                    decoder.blocks.2.mlp_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   decoder.blocks.2.mlp.0.weight - [  384,  1536,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     decoder.blocks.2.mlp.0.bias - [ 1536,     1,     1], type =    f32 size =    0.006 MB
                                   decoder.blocks.2.mlp.2.weight - [ 1536,   384,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     decoder.blocks.2.mlp.2.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                 decoder.blocks.2.attn_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   decoder.blocks.2.attn_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                              decoder.blocks.2.attn.query.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                decoder.blocks.2.attn.query.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                decoder.blocks.2.attn.key.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                              decoder.blocks.2.attn.value.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                decoder.blocks.2.attn.value.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                decoder.blocks.2.attn.out.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                  decoder.blocks.2.attn.out.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                           decoder.blocks.2.cross_attn_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                             decoder.blocks.2.cross_attn_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                        decoder.blocks.2.cross_attn.query.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                          decoder.blocks.2.cross_attn.query.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                          decoder.blocks.2.cross_attn.key.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                        decoder.blocks.2.cross_attn.value.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                          decoder.blocks.2.cross_attn.value.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                          decoder.blocks.2.cross_attn.out.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                            decoder.blocks.2.cross_attn.out.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                  decoder.blocks.3.mlp_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                    decoder.blocks.3.mlp_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   decoder.blocks.3.mlp.0.weight - [  384,  1536,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     decoder.blocks.3.mlp.0.bias - [ 1536,     1,     1], type =    f32 size =    0.006 MB
                                   decoder.blocks.3.mlp.2.weight - [ 1536,   384,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     decoder.blocks.3.mlp.2.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                 decoder.blocks.3.attn_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   decoder.blocks.3.attn_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                              decoder.blocks.3.attn.query.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                decoder.blocks.3.attn.query.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                decoder.blocks.3.attn.key.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                              decoder.blocks.3.attn.value.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                decoder.blocks.3.attn.value.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                decoder.blocks.3.attn.out.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                  decoder.blocks.3.attn.out.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                           decoder.blocks.3.cross_attn_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                             decoder.blocks.3.cross_attn_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                        decoder.blocks.3.cross_attn.query.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                          decoder.blocks.3.cross_attn.query.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                          decoder.blocks.3.cross_attn.key.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                        decoder.blocks.3.cross_attn.value.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                          decoder.blocks.3.cross_attn.value.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                          decoder.blocks.3.cross_attn.out.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                            decoder.blocks.3.cross_attn.out.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                               decoder.ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                                 decoder.ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                            encoder.conv1.weight - [    3,    80,   384], type =    f16 size =    0.176 MB
                                              encoder.conv1.bias - [    1,   384,     1], type =    f32 size =    0.001 MB
                                            encoder.conv2.weight - [    3,   384,   384], type =    f16 size =    0.844 MB
                                              encoder.conv2.bias - [    1,   384,     1], type =    f32 size =    0.001 MB
                                  encoder.blocks.0.mlp_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                    encoder.blocks.0.mlp_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   encoder.blocks.0.mlp.0.weight - [  384,  1536,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     encoder.blocks.0.mlp.0.bias - [ 1536,     1,     1], type =    f32 size =    0.006 MB
                                   encoder.blocks.0.mlp.2.weight - [ 1536,   384,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     encoder.blocks.0.mlp.2.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                 encoder.blocks.0.attn_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   encoder.blocks.0.attn_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                              encoder.blocks.0.attn.query.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                encoder.blocks.0.attn.query.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                encoder.blocks.0.attn.key.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                              encoder.blocks.0.attn.value.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                encoder.blocks.0.attn.value.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                encoder.blocks.0.attn.out.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                  encoder.blocks.0.attn.out.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                  encoder.blocks.1.mlp_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                    encoder.blocks.1.mlp_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   encoder.blocks.1.mlp.0.weight - [  384,  1536,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     encoder.blocks.1.mlp.0.bias - [ 1536,     1,     1], type =    f32 size =    0.006 MB
                                   encoder.blocks.1.mlp.2.weight - [ 1536,   384,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     encoder.blocks.1.mlp.2.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                 encoder.blocks.1.attn_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   encoder.blocks.1.attn_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                              encoder.blocks.1.attn.query.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                encoder.blocks.1.attn.query.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                encoder.blocks.1.attn.key.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                              encoder.blocks.1.attn.value.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                encoder.blocks.1.attn.value.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                encoder.blocks.1.attn.out.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                  encoder.blocks.1.attn.out.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                  encoder.blocks.2.mlp_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                    encoder.blocks.2.mlp_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   encoder.blocks.2.mlp.0.weight - [  384,  1536,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     encoder.blocks.2.mlp.0.bias - [ 1536,     1,     1], type =    f32 size =    0.006 MB
                                   encoder.blocks.2.mlp.2.weight - [ 1536,   384,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     encoder.blocks.2.mlp.2.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                 encoder.blocks.2.attn_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   encoder.blocks.2.attn_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                              encoder.blocks.2.attn.query.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                encoder.blocks.2.attn.query.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                encoder.blocks.2.attn.key.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                              encoder.blocks.2.attn.value.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                encoder.blocks.2.attn.value.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                encoder.blocks.2.attn.out.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                  encoder.blocks.2.attn.out.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                  encoder.blocks.3.mlp_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                    encoder.blocks.3.mlp_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   encoder.blocks.3.mlp.0.weight - [  384,  1536,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     encoder.blocks.3.mlp.0.bias - [ 1536,     1,     1], type =    f32 size =    0.006 MB
                                   encoder.blocks.3.mlp.2.weight - [ 1536,   384,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     encoder.blocks.3.mlp.2.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                 encoder.blocks.3.attn_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   encoder.blocks.3.attn_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                              encoder.blocks.3.attn.query.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                encoder.blocks.3.attn.query.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                encoder.blocks.3.attn.key.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                              encoder.blocks.3.attn.value.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                encoder.blocks.3.attn.value.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                encoder.blocks.3.attn.out.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                  encoder.blocks.3.attn.out.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                          encoder.ln_post.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                            encoder.ln_post.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
ggml_common_quantize_0: model size  =   144.04 MB
ggml_common_quantize_0: quant size  =    18.98 MB | ftype = 11 (q3_K)

main: quantize time =  1125.16 ms
main:    total time =  1125.16 ms

> whisper-cli -f samples/jfk.wav -m models/ggml-tiny.en.q3_k

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-tiny.en.q3_k'
whisper_init_with_params_no_state: use gpu    = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw        = 0
whisper_init_with_params_no_state: devices    = 1
whisper_init_with_params_no_state: backends   = 1
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 11
whisper_model_load: qntvr         = 2
whisper_model_load: type          = 1 (tiny)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:      CPU total size =    15.36 MB
whisper_model_load: tensor 'decoder.token_embedding.weight' has wrong size in model file: got 5705095, expected 2190735360
whisper_init_with_params_no_state: failed to load model
error: failed to initialize whisper context

edit: fixed formating :3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions