Skip to content

Qwen 3 and Qwen 2vl Support on llama #98

@TheOneWhoWill

Description

@TheOneWhoWill

I'm trying to load the 0.6B Parameter 8 bit version of Qwen 3 from a GGUF file however the package panics internally

2025-08-12T04:50:17.688676Z  INFO llama.cpp: llama_model_loader: - kv   0:                       general.architecture str              = qwen3      
2025-08-12T04:50:17.688805Z  INFO llama.cpp: llama_model_loader: - kv   1:                               general.type str              = model      
2025-08-12T04:50:17.688940Z  INFO llama.cpp: llama_model_loader: - kv   2:                               general.name str              = Qwen3 1.7B Instruct
2025-08-12T04:50:17.689118Z  INFO llama.cpp: llama_model_loader: - kv   3:                           general.finetune str              = Instruct   
2025-08-12T04:50:17.689212Z  INFO llama.cpp: llama_model_loader: - kv   4:                           general.basename str              = Qwen3      
2025-08-12T04:50:17.689365Z  INFO llama.cpp: llama_model_loader: - kv   5:                         general.size_label str              = 1.7B       
2025-08-12T04:50:17.689461Z  INFO llama.cpp: llama_model_loader: - kv   6:                          qwen3.block_count u32              = 28
2025-08-12T04:50:17.689538Z  INFO llama.cpp: llama_model_loader: - kv   7:                       qwen3.context_length u32              = 40960      
2025-08-12T04:50:17.689661Z  INFO llama.cpp: llama_model_loader: - kv   8:                     qwen3.embedding_length u32              = 2048       
2025-08-12T04:50:17.689811Z  INFO llama.cpp: llama_model_loader: - kv   9:                  qwen3.feed_forward_length u32              = 6144       
2025-08-12T04:50:17.689945Z  INFO llama.cpp: llama_model_loader: - kv  10:                 qwen3.attention.head_count u32              = 16
2025-08-12T04:50:17.690053Z  INFO llama.cpp: llama_model_loader: - kv  11:              qwen3.attention.head_count_kv u32              = 8
2025-08-12T04:50:17.690192Z  INFO llama.cpp: llama_model_loader: - kv  12:                       qwen3.rope.freq_base f32              = 1000000.000000
2025-08-12T04:50:17.690333Z  INFO llama.cpp: llama_model_loader: - kv  13:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001   
2025-08-12T04:50:17.690467Z  INFO llama.cpp: llama_model_loader: - kv  14:                 qwen3.attention.key_length u32              = 128        
2025-08-12T04:50:17.690596Z  INFO llama.cpp: llama_model_loader: - kv  15:               qwen3.attention.value_length u32              = 128        
2025-08-12T04:50:17.690698Z  INFO llama.cpp: llama_model_loader: - kv  16:                       tokenizer.ggml.model str              = gpt2       
2025-08-12T04:50:17.690811Z  INFO llama.cpp: llama_model_loader: - kv  17:                         tokenizer.ggml.pre str              = qwen2      
2025-08-12T04:50:17.826803Z  INFO llama.cpp: llama_model_loader: - kv  18:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
2025-08-12T04:50:17.858514Z  INFO llama.cpp: llama_model_loader: - kv  19:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
2025-08-12T04:50:18.002530Z  INFO llama.cpp: llama_model_loader: - kv  20:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
2025-08-12T04:50:18.003067Z  INFO llama.cpp: llama_model_loader: - kv  21:                tokenizer.ggml.eos_token_id u32              = 151645
- type  f32:  113 tensors
2025-08-12T04:50:18.005749Z  INFO llama.cpp: llama_model_loader: - type q8_0:  197 tensors
2025-08-12T04:50:18.006060Z ERROR llama.cpp: llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'qwen3'
2025-08-12T04:50:18.006218Z ERROR llama.cpp: llama_load_model_from_file: failed to load model

thread 'main' panicked at C:\Users\norik\OneDrive\Desktop\Projects\qlerk\src-tauri\src\lib.rs:75:6:
Failed to create LLM: LlamaError(LlamaInternalError)
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Alternatively when I try to load jina-embeddings-v4-text-retrieval-GGUF which is based on qwen2.5-vl-3b-instruct I get this unsupported message:

2025-08-12T05:00:46.839192Z  INFO llama.cpp: llama_model_loader: - kv   0:                       general.architecture str              = qwen2vl
2025-08-12T05:00:46.839273Z  INFO llama.cpp: llama_model_loader: - kv   1:                               general.type str              = model
2025-08-12T05:00:46.839358Z  INFO llama.cpp: llama_model_loader: - kv   2:                               general.name str              = Jev4 Text Retrieval
2025-08-12T05:00:46.839572Z  INFO llama.cpp: llama_model_loader: - kv   3:                         general.size_label str              = 3.1B       
2025-08-12T05:00:46.839713Z  INFO llama.cpp: llama_model_loader: - kv   4:                        qwen2vl.block_count u32              = 36
2025-08-12T05:00:46.839859Z  INFO llama.cpp: llama_model_loader: - kv   5:                     qwen2vl.context_length u32              = 128000     
2025-08-12T05:00:46.840071Z  INFO llama.cpp: llama_model_loader: - kv   6:                   qwen2vl.embedding_length u32              = 2048       
2025-08-12T05:00:46.840253Z  INFO llama.cpp: llama_model_loader: - kv   7:                qwen2vl.feed_forward_length u32              = 11008      
2025-08-12T05:00:46.840437Z  INFO llama.cpp: llama_model_loader: - kv   8:               qwen2vl.attention.head_count u32              = 16
2025-08-12T05:00:46.840588Z  INFO llama.cpp: llama_model_loader: - kv   9:            qwen2vl.attention.head_count_kv u32              = 2
2025-08-12T05:00:46.840739Z  INFO llama.cpp: llama_model_loader: - kv  10:                     qwen2vl.rope.freq_base f32              = 1000000.000000
2025-08-12T05:00:46.841019Z  INFO llama.cpp: llama_model_loader: - kv  11:   qwen2vl.attention.layer_norm_rms_epsilon f32              = 0.000001   
2025-08-12T05:00:46.841254Z  INFO llama.cpp: llama_model_loader: - kv  12:            qwen2vl.rope.dimension_sections arr[i32,4]       = [16, 24, 24, 0]
2025-08-12T05:00:46.841497Z  INFO llama.cpp: llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2       
2025-08-12T05:00:46.841658Z  INFO llama.cpp: llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = qwen2      
2025-08-12T05:00:46.976960Z  INFO llama.cpp: llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
2025-08-12T05:00:47.004920Z  INFO llama.cpp: llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
2025-08-12T05:00:47.138014Z  INFO llama.cpp: llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
2025-08-12T05:00:47.138306Z  INFO llama.cpp: llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 151643
2025-08-12T05:00:47.138475Z  INFO llama.cpp: llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 151645     
2025-08-12T05:00:47.138670Z  INFO llama.cpp: llama_model_loader: - kv  20:               general.quantization_version u32              = 2
2025-08-12T05:00:47.138787Z  INFO llama.cpp: llama_model_loader: - kv  21:                          general.file_type u32              = 15
2025-08-12T05:00:47.138860Z  INFO llama.cpp: llama_model_loader: - kv  22:                      quantize.imatrix.file str              = imatrix-retrieval-512.dat
2025-08-12T05:00:47.139019Z  INFO llama.cpp: llama_model_loader: - kv  23:                   quantize.imatrix.dataset str              = calibration_data_v5_rc.txt
2025-08-12T05:00:47.139186Z  INFO llama.cpp: llama_model_loader: - kv  24:             quantize.imatrix.entries_count u32              = 252        
2025-08-12T05:00:47.139510Z  INFO llama.cpp: llama_model_loader: - kv  25:              quantize.imatrix.chunks_count u32              = 225        
2025-08-12T05:00:47.139743Z  INFO llama.cpp: llama_model_loader: - type  f32:  181 tensors
2025-08-12T05:00:47.139902Z  INFO llama.cpp: llama_model_loader: - type q4_K:  216 tensors
2025-08-12T05:00:47.140094Z  INFO llama.cpp: llama_model_loader: - type q6_K:   37 tensors
2025-08-12T05:00:47.140657Z ERROR llama.cpp: llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'qwen2vl'
2025-08-12T05:00:47.140781Z ERROR llama.cpp: llama_load_model_from_file: failed to load model

This is odd because I saw here that this architecture has indeed been supported on llama since December of 2024

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions