Skip to content

Reading GGUF metadata with gguf-dump.py does not work for i-quants #5809

@countzero

Description

@countzero

The gguf-dump.py script in the llama.cpp release b2297 is missing support for i-quants.

Steps to reproduce

  1. Create or download a GGUF file in any IQ* format (e.g., miqu-1-70b-Requant-b2131-iMat-c32_ch400-IQ1_S_v3.gguf)
  2. Copy the file to .\models\miqu-1-70b-sf.IQ1_S.gguf
  3. Execute the following
python .\gguf-py\scripts\gguf-dump.py --no-tensors .\models\miqu-1-70b-sf.IQ1_S.gguf
  1. See the error:
ValueError: 19 is not a valid GGMLQuantizationType

Expected behaviour

I expect the Python gguf-py library to support all possible GGUF formats.

Working example for k-quants:

python .\gguf-py\scripts\gguf-dump.py --no-tensors .\models\miqu-1-70b-sf.Q5_K_M.gguf
* Loading: .\models\miqu-1-70b-sf.Q5_K_M.gguf
* File is LITTLE endian, script is running on a LITTLE endian host.

* Dumping 26 key/value pair(s)
      1: UINT32     |        1 | GGUF.version = 3
      2: UINT64     |        1 | GGUF.tensor_count = 723
      3: UINT64     |        1 | GGUF.kv_count = 23
      4: STRING     |        1 | general.architecture = 'llama'
      5: STRING     |        1 | general.name = 'R:\\AI\\LLM\\source'
      6: UINT32     |        1 | llama.context_length = 32764
      7: UINT32     |        1 | llama.embedding_length = 8192
      8: UINT32     |        1 | llama.block_count = 80
      9: UINT32     |        1 | llama.feed_forward_length = 28672
     10: UINT32     |        1 | llama.rope.dimension_count = 128
     11: UINT32     |        1 | llama.attention.head_count = 64
     12: UINT32     |        1 | llama.attention.head_count_kv = 8
     13: FLOAT32    |        1 | llama.attention.layer_norm_rms_epsilon = 9.999999747378752e-06
     14: FLOAT32    |        1 | llama.rope.freq_base = 1000000.0
     15: UINT32     |        1 | general.file_type = 17
     16: STRING     |        1 | tokenizer.ggml.model = 'llama'
     17: [STRING]   |    32000 | tokenizer.ggml.tokens
     18: [FLOAT32]  |    32000 | tokenizer.ggml.scores
     19: [INT32]    |    32000 | tokenizer.ggml.token_type
     20: UINT32     |        1 | tokenizer.ggml.bos_token_id = 1
     21: UINT32     |        1 | tokenizer.ggml.eos_token_id = 2
     22: UINT32     |        1 | tokenizer.ggml.padding_token_id = 0
     23: BOOL       |        1 | tokenizer.ggml.add_bos_token = True
     24: BOOL       |        1 | tokenizer.ggml.add_eos_token = False
     25: STRING     |        1 | tokenizer.chat_template = "{{ bos_token }}{% for message in messages %}{% if (message['"
     26: UINT32     |        1 | general.quantization_version = 2

Use-Case

I am extracting the metadata from any given GGUF model to automatically calculate the optimal runtime arguments for the server in the following PowerShell script: https://github.com/countzero/windows_llama.cpp/blob/v1.12.0/examples/server.ps1#L104

Question

@ggerganov Is there another way to only dump the metadata from a given GGUF model? Perhaps this could be an --inspect option of the gguf binary?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions