Open
Description
After quantizing with q3_k the resulting model is unusable. Quantize runs without errors, it appears to describe some wrong metadata on the model.(?)
whisper_model_load: tensor 'decoder.token_embedding.weight' has wrong size in model file: got 5705095, expected 2190735360
Tested on latest master e27fd6f
Relevant logs below.
> quantize models/ggml-tiny.en.bin models/ggml-tiny.en.q3_k 11
whisper_model_quantize: n_vocab = 51864
whisper_model_quantize: n_audio_ctx = 1500
whisper_model_quantize: n_audio_state = 384
whisper_model_quantize: n_audio_head = 6
whisper_model_quantize: n_audio_layer = 4
whisper_model_quantize: n_text_ctx = 448
whisper_model_quantize: n_text_state = 384
whisper_model_quantize: n_text_head = 6
whisper_model_quantize: n_text_layer = 4
whisper_model_quantize: n_mels = 80
whisper_model_quantize: ftype (src) = 1
whisper_model_quantize: qntvr (src) = 0
whisper_model_quantize: ftype (dst) = 2011
whisper_model_quantize: qntvr (dst) = 2
whisper_model_quantize: loading model from 'models/ggml-tiny.en.bin'
decoder.positional_embedding - [ 384, 448, 1], type = f32 size = 0.656 MB
encoder.positional_embedding - [ 384, 1500, 1], type = f32 size = 2.197 MB
decoder.token_embedding.weight - [ 384, 51864, 1], type = f16 size = 75.97 MB -> 8.16 MB
decoder.blocks.0.mlp_ln.weight - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.0.mlp_ln.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.0.mlp.0.weight - [ 384, 1536, 1], type = f16 size = 2.25 MB -> 0.24 MB
decoder.blocks.0.mlp.0.bias - [ 1536, 1, 1], type = f32 size = 0.006 MB
decoder.blocks.0.mlp.2.weight - [ 1536, 384, 1], type = f16 size = 2.25 MB -> 0.24 MB
decoder.blocks.0.mlp.2.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.0.attn_ln.weight - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.0.attn_ln.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.0.attn.query.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
decoder.blocks.0.attn.query.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.0.attn.key.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
decoder.blocks.0.attn.value.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
decoder.blocks.0.attn.value.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.0.attn.out.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
decoder.blocks.0.attn.out.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.0.cross_attn_ln.weight - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.0.cross_attn_ln.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.0.cross_attn.query.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
decoder.blocks.0.cross_attn.query.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.0.cross_attn.key.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
decoder.blocks.0.cross_attn.value.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
decoder.blocks.0.cross_attn.value.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.0.cross_attn.out.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
decoder.blocks.0.cross_attn.out.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.1.mlp_ln.weight - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.1.mlp_ln.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.1.mlp.0.weight - [ 384, 1536, 1], type = f16 size = 2.25 MB -> 0.24 MB
decoder.blocks.1.mlp.0.bias - [ 1536, 1, 1], type = f32 size = 0.006 MB
decoder.blocks.1.mlp.2.weight - [ 1536, 384, 1], type = f16 size = 2.25 MB -> 0.24 MB
decoder.blocks.1.mlp.2.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.1.attn_ln.weight - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.1.attn_ln.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.1.attn.query.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
decoder.blocks.1.attn.query.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.1.attn.key.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
decoder.blocks.1.attn.value.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
decoder.blocks.1.attn.value.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.1.attn.out.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
decoder.blocks.1.attn.out.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.1.cross_attn_ln.weight - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.1.cross_attn_ln.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.1.cross_attn.query.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
decoder.blocks.1.cross_attn.query.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.1.cross_attn.key.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
decoder.blocks.1.cross_attn.value.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
decoder.blocks.1.cross_attn.value.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.1.cross_attn.out.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
decoder.blocks.1.cross_attn.out.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.2.mlp_ln.weight - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.2.mlp_ln.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.2.mlp.0.weight - [ 384, 1536, 1], type = f16 size = 2.25 MB -> 0.24 MB
decoder.blocks.2.mlp.0.bias - [ 1536, 1, 1], type = f32 size = 0.006 MB
decoder.blocks.2.mlp.2.weight - [ 1536, 384, 1], type = f16 size = 2.25 MB -> 0.24 MB
decoder.blocks.2.mlp.2.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.2.attn_ln.weight - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.2.attn_ln.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.2.attn.query.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
decoder.blocks.2.attn.query.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.2.attn.key.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
decoder.blocks.2.attn.value.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
decoder.blocks.2.attn.value.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.2.attn.out.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
decoder.blocks.2.attn.out.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.2.cross_attn_ln.weight - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.2.cross_attn_ln.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.2.cross_attn.query.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
decoder.blocks.2.cross_attn.query.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.2.cross_attn.key.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
decoder.blocks.2.cross_attn.value.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
decoder.blocks.2.cross_attn.value.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.2.cross_attn.out.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
decoder.blocks.2.cross_attn.out.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.3.mlp_ln.weight - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.3.mlp_ln.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.3.mlp.0.weight - [ 384, 1536, 1], type = f16 size = 2.25 MB -> 0.24 MB
decoder.blocks.3.mlp.0.bias - [ 1536, 1, 1], type = f32 size = 0.006 MB
decoder.blocks.3.mlp.2.weight - [ 1536, 384, 1], type = f16 size = 2.25 MB -> 0.24 MB
decoder.blocks.3.mlp.2.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.3.attn_ln.weight - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.3.attn_ln.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.3.attn.query.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
decoder.blocks.3.attn.query.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.3.attn.key.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
decoder.blocks.3.attn.value.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
decoder.blocks.3.attn.value.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.3.attn.out.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
decoder.blocks.3.attn.out.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.3.cross_attn_ln.weight - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.3.cross_attn_ln.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.3.cross_attn.query.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
decoder.blocks.3.cross_attn.query.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.3.cross_attn.key.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
decoder.blocks.3.cross_attn.value.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
decoder.blocks.3.cross_attn.value.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.blocks.3.cross_attn.out.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
decoder.blocks.3.cross_attn.out.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.ln.weight - [ 384, 1, 1], type = f32 size = 0.001 MB
decoder.ln.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
encoder.conv1.weight - [ 3, 80, 384], type = f16 size = 0.176 MB
encoder.conv1.bias - [ 1, 384, 1], type = f32 size = 0.001 MB
encoder.conv2.weight - [ 3, 384, 384], type = f16 size = 0.844 MB
encoder.conv2.bias - [ 1, 384, 1], type = f32 size = 0.001 MB
encoder.blocks.0.mlp_ln.weight - [ 384, 1, 1], type = f32 size = 0.001 MB
encoder.blocks.0.mlp_ln.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
encoder.blocks.0.mlp.0.weight - [ 384, 1536, 1], type = f16 size = 2.25 MB -> 0.24 MB
encoder.blocks.0.mlp.0.bias - [ 1536, 1, 1], type = f32 size = 0.006 MB
encoder.blocks.0.mlp.2.weight - [ 1536, 384, 1], type = f16 size = 2.25 MB -> 0.24 MB
encoder.blocks.0.mlp.2.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
encoder.blocks.0.attn_ln.weight - [ 384, 1, 1], type = f32 size = 0.001 MB
encoder.blocks.0.attn_ln.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
encoder.blocks.0.attn.query.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
encoder.blocks.0.attn.query.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
encoder.blocks.0.attn.key.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
encoder.blocks.0.attn.value.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
encoder.blocks.0.attn.value.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
encoder.blocks.0.attn.out.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
encoder.blocks.0.attn.out.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
encoder.blocks.1.mlp_ln.weight - [ 384, 1, 1], type = f32 size = 0.001 MB
encoder.blocks.1.mlp_ln.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
encoder.blocks.1.mlp.0.weight - [ 384, 1536, 1], type = f16 size = 2.25 MB -> 0.24 MB
encoder.blocks.1.mlp.0.bias - [ 1536, 1, 1], type = f32 size = 0.006 MB
encoder.blocks.1.mlp.2.weight - [ 1536, 384, 1], type = f16 size = 2.25 MB -> 0.24 MB
encoder.blocks.1.mlp.2.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
encoder.blocks.1.attn_ln.weight - [ 384, 1, 1], type = f32 size = 0.001 MB
encoder.blocks.1.attn_ln.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
encoder.blocks.1.attn.query.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
encoder.blocks.1.attn.query.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
encoder.blocks.1.attn.key.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
encoder.blocks.1.attn.value.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
encoder.blocks.1.attn.value.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
encoder.blocks.1.attn.out.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
encoder.blocks.1.attn.out.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
encoder.blocks.2.mlp_ln.weight - [ 384, 1, 1], type = f32 size = 0.001 MB
encoder.blocks.2.mlp_ln.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
encoder.blocks.2.mlp.0.weight - [ 384, 1536, 1], type = f16 size = 2.25 MB -> 0.24 MB
encoder.blocks.2.mlp.0.bias - [ 1536, 1, 1], type = f32 size = 0.006 MB
encoder.blocks.2.mlp.2.weight - [ 1536, 384, 1], type = f16 size = 2.25 MB -> 0.24 MB
encoder.blocks.2.mlp.2.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
encoder.blocks.2.attn_ln.weight - [ 384, 1, 1], type = f32 size = 0.001 MB
encoder.blocks.2.attn_ln.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
encoder.blocks.2.attn.query.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
encoder.blocks.2.attn.query.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
encoder.blocks.2.attn.key.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
encoder.blocks.2.attn.value.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
encoder.blocks.2.attn.value.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
encoder.blocks.2.attn.out.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
encoder.blocks.2.attn.out.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
encoder.blocks.3.mlp_ln.weight - [ 384, 1, 1], type = f32 size = 0.001 MB
encoder.blocks.3.mlp_ln.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
encoder.blocks.3.mlp.0.weight - [ 384, 1536, 1], type = f16 size = 2.25 MB -> 0.24 MB
encoder.blocks.3.mlp.0.bias - [ 1536, 1, 1], type = f32 size = 0.006 MB
encoder.blocks.3.mlp.2.weight - [ 1536, 384, 1], type = f16 size = 2.25 MB -> 0.24 MB
encoder.blocks.3.mlp.2.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
encoder.blocks.3.attn_ln.weight - [ 384, 1, 1], type = f32 size = 0.001 MB
encoder.blocks.3.attn_ln.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
encoder.blocks.3.attn.query.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
encoder.blocks.3.attn.query.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
encoder.blocks.3.attn.key.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
encoder.blocks.3.attn.value.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
encoder.blocks.3.attn.value.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
encoder.blocks.3.attn.out.weight - [ 384, 384, 1], type = f16 size = 0.56 MB -> 0.06 MB
encoder.blocks.3.attn.out.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
encoder.ln_post.weight - [ 384, 1, 1], type = f32 size = 0.001 MB
encoder.ln_post.bias - [ 384, 1, 1], type = f32 size = 0.001 MB
ggml_common_quantize_0: model size = 144.04 MB
ggml_common_quantize_0: quant size = 18.98 MB | ftype = 11 (q3_K)
main: quantize time = 1125.16 ms
main: total time = 1125.16 ms
> whisper-cli -f samples/jfk.wav -m models/ggml-tiny.en.q3_k
whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-tiny.en.q3_k'
whisper_init_with_params_no_state: use gpu = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw = 0
whisper_init_with_params_no_state: devices = 1
whisper_init_with_params_no_state: backends = 1
whisper_model_load: loading model
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 384
whisper_model_load: n_text_head = 6
whisper_model_load: n_text_layer = 4
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 11
whisper_model_load: qntvr = 2
whisper_model_load: type = 1 (tiny)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs = 99
whisper_model_load: CPU total size = 15.36 MB
whisper_model_load: tensor 'decoder.token_embedding.weight' has wrong size in model file: got 5705095, expected 2190735360
whisper_init_with_params_no_state: failed to load model
error: failed to initialize whisper context
edit: fixed formating :3
Metadata
Metadata
Assignees
Labels
No labels