Skip to content

imatrix: calculate activation-based statistics for new format (GGUF) imatrices #14891

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 40 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
09bc7c2
Use activations to calculate the stats
EAddario Jul 26, 2025
2097f03
Refactor variable names
EAddario Jul 31, 2025
78ddb47
Fix problem up when GGUF does not have in_sum
EAddario Aug 2, 2025
9744a4a
Determine calculation mode
EAddario Aug 2, 2025
cce514a
Compute entropy for activations
EAddario Aug 2, 2025
b7fb362
Compute cosine similarity based on activations
EAddario Aug 2, 2025
9b841eb
Compute l2 norm
EAddario Aug 2, 2025
ee2509f
Adjust threshold
EAddario Aug 2, 2025
fc8f925
Update table display
EAddario Aug 2, 2025
4c01f51
Remove inactive
EAddario Aug 2, 2025
a32a2ec
Reformat report layout
EAddario Aug 2, 2025
4d1325e
Refactor variables
EAddario Aug 3, 2025
5324558
Update table layout
EAddario Aug 3, 2025
fce05aa
Refactor lambda into compute_tensor_averages() function
EAddario Aug 3, 2025
be60469
Refactor function names
EAddario Aug 3, 2025
a6155a8
Add compute_layer_statistics() function
EAddario Aug 3, 2025
2117c4e
Update aggregated statistic report layout
EAddario Aug 3, 2025
90cb1be
Minor cosmetic changes
EAddario Aug 3, 2025
f1c2a4c
Fix printing l2 norm when calc_mode = 1
EAddario Aug 3, 2025
c39c4e2
Refactor variable name
EAddario Aug 4, 2025
adbff66
Merge branch 'master' into imatrix
EAddario Aug 4, 2025
5e40cf4
Do not resize if in_sum is null
EAddario Aug 4, 2025
b373934
Compute aggregated (per layer) l2 norm
EAddario Aug 5, 2025
906548a
Update aggregated sum of squared activations per layer
EAddario Aug 5, 2025
aea9b31
Make ZD Score two-tailed
EAddario Aug 5, 2025
49996a1
Refactor variable names
EAddario Aug 5, 2025
4c3fea8
Update report layout
EAddario Aug 5, 2025
88854c9
Refactor legacy mode
EAddario Aug 5, 2025
030ed3c
Merge branch 'master' into imatrix
EAddario Aug 5, 2025
c7959ed
Merge branch 'master' into imatrix
EAddario Aug 7, 2025
3e9d53c
Refactor variable names
EAddario Aug 7, 2025
e0d6471
Reverse conditional logic to match convention
EAddario Aug 7, 2025
dadd90e
Rename report heading
EAddario Aug 7, 2025
5bb2def
Add --activation-statistics parameter
EAddario Aug 7, 2025
c5ecdaa
Add Euclidean–Cosine Score (ECS)
EAddario Aug 7, 2025
59af503
Update README.md
EAddario Aug 9, 2025
9467963
Merge branch 'master' into imatrix
EAddario Aug 9, 2025
6fe51e1
Fix typo in ECS formula
EAddario Aug 9, 2025
dcac206
Add --activation-statistics logic to avoid doubling the imatrix size …
EAddario Aug 9, 2025
89051cd
Update README.md
EAddario Aug 9, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions common/arg.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2707,6 +2707,13 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
params.show_statistics = true;
}
).set_examples({LLAMA_EXAMPLE_IMATRIX}));
add_opt(common_arg(
{"--activation-statistics"},
string_format("generate data to compute activation-based statistics (default: %s)", params.show_statistics ? "true" : "false"),
[](common_params & params) {
params.activation_statistics = true;
}
).set_examples({LLAMA_EXAMPLE_IMATRIX}));
add_opt(common_arg(
{"--parse-special"},
string_format("prase special tokens (chat, tool, etc) (default: %s)", params.parse_special ? "true" : "false"),
Expand Down
9 changes: 5 additions & 4 deletions common/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -443,10 +443,11 @@ struct common_params {
int32_t i_chunk = 0; // start processing from this chunk
int8_t imat_dat = 0; // whether the legacy imatrix.dat format should be output (gguf <= 0 < dat)

bool process_output = false; // collect data for the output tensor
bool compute_ppl = true; // whether to compute perplexity
bool show_statistics = false; // show imatrix statistics per tensor
bool parse_special = false; // whether to parse special tokens during imatrix tokenization
bool process_output = false; // collect data for the output tensor
bool compute_ppl = true; // whether to compute perplexity
bool show_statistics = false; // show imatrix statistics per tensor
bool activation_statistics = false; // generate data to calculate activation based statistics
bool parse_special = false; // whether to parse special tokens during imatrix tokenization

// cvector-generator params
int n_pca_batch = 100;
Expand Down
44 changes: 26 additions & 18 deletions tools/imatrix/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ More information is available in <https://github.com/ggml-org/llama.cpp/pull/486
-m model.gguf -f some-text.txt [-o imatrix.gguf] [--output-format {gguf,dat}] [--no-ppl] \
[--process-output] [--chunk 123] [--save-frequency 0] [--output-frequency 10] \
[--in-file imatrix-prev-0.gguf --in-file imatrix-prev-1.gguf ...] [--parse-special] \
[--show-statistics] [...]
[--activation-statistics] [--show-statistics] [...]
```

Here `-m | --model` with a model name and `-f | --file` with a file containing calibration data (such as e.g. `wiki.train.raw`) are mandatory.
Expand All @@ -20,19 +20,20 @@ The parameters in square brackets are optional and have the following meaning:
* `-lv | --verbosity` specifies the verbosity level. If set to `0`, no output other than the perplexity of the processed chunks will be generated. If set to `1`, each time the results are saved a message is written to `stderr`. If `>=2`, a message is output each time data is collected for any tensor. Default verbosity level is `1`.
* `-o | --output-file` specifies the name of the file where the computed data will be stored. If missing `imatrix.gguf` is used.
* `-ofreq | --output-frequency` specifies how often the so far computed result is saved to disk. Default is 10 (i.e., every 10 chunks)
* `--output-format` specifies the output format of the generated imatrix file. Either "gguf", or "dat" (the legacy format). Defaults to "gguf".
* `--output-format` specifies the output format of the generated imatrix file. Either `gguf`, or `dat` (the legacy format). Defaults to `gguf`.
* `--save-frequency` specifies how often to save a copy of the imatrix in a separate file. Default is 0 (i.e., never)
* `--process-output` specifies if data will be collected for the `output.weight` tensor. Typically, it is better not to utilize the importance matrix when quantizing `output.weight`, so this is set to `false` by default.
* `--in-file` one or more existing imatrix files to load and combine. Useful for merging files from multiple runs/datasets.
* `--parse-special` enables parsing of special tokens (e.g., `<|im_start|>` in some models). Useful for models with custom tokenizers.
* `--chunk | --from-chunk` to skip the first `n` chunks of tokens from the input data. Useful for resuming or skipping initial low-quality data.
* `--chunks` maximum number of chunks to process. Default is -1 for all available chunks.
* `--chunks` maximum number of chunks to process. Default is `-1` for all available chunks.
* `--no-ppl` disables the calculation of perplexity for the processed chunks. Useful if you want to speed up the processing and do not care about perplexity.
* `--show-statistics` displays imatrix file's statistics.
* `--activation-statistics` enables the collection of activation statistics for each tensor. If set, the imatrix file size will double, but reported statistics will be more accurate.

For faster computation, make sure to use GPU offloading via the `-ngl | --n-gpu-layers` argument.

Recent versions of `llama-imatrix` store data in GGUF format by default. For the legacy format, use an extension other than `.gguf` when saving the output file. More information is available in <https://github.com/ggml-org/llama.cpp/pull/9400>.
Versions **b5942** and newer of `llama-imatrix` store data in GGUF format by default. For the legacy format, use `--output-format dat` when saving the output file. More information is available in <https://github.com/ggml-org/llama.cpp/pull/9400>.

## Examples

Expand Down Expand Up @@ -69,30 +70,37 @@ Recent versions of `llama-imatrix` store data in GGUF format by default. For the
./llama-imatrix -m ggml-model-f16.gguf -f calibration-data.txt --chunk 5 --output-frequency 20 --save-frequency 50 --parse-special
```

```bash
# generate imatrix and enable activation-based statistics
./llama-imatrix -m ggml-model-f16.gguf -f calibration-data.txt --activation-statistics -ngl 99
```

```bash
# analyse imatrix file and display summary statistics instead of running inference
./llama-imatrix --in-file imatrix.gguf --show-statistics
```

`--show-statistics` will display the following statistics:
## Statistics

Beginning with version <bwxyz>, `--show-statistics` has two modes. If `--activation-statistics` was used at imatrix creation time and `--output-format` was set to `gguf`, it reports precise statistics. Otherwise, it reports less accurate, albeit still useful, metrics based on average squared activations.

#### Per tensor

* Σ(Act²): sum of all squared activations (the importance scores)
* Min & Max: minimum and maximum squared activations values
* μ & σ: Squared activations' mean and standard deviation
* % Active: proportion of elements whose average squared activation exceeds a small threshold (1e-5). Helpful to determine how alive/dormant the tensor is during inference
* N: number of squared activations
* Entropy: entropy of the squared activation distribution, in bits (standard Shannon entropy measurement) $S = -\sum_{i=1}^N p_i \log_2 p_i$
* E (norm): Normalized entropy. $E(norm)=\frac{-\sum_{i=1}^N p_i \log_2 p_i}{log_2 N}$. These two metrics can be used to determine how well a prompt "exercises" the model's capabilities
* ZD Score: z-score distribution as described in _3.1 Layer Importance Scores_ of [Layer-Wise Quantization](https://arxiv.org/abs/2406.17415)
* CosSim: cosine similarity with respect to the previous layer's tensor. Useful to determine how similar the squared activations of the current layer are to the previous layer's squared activations.
* **Σ(Act²)** *(legacy mode)* / **L₂ Norm** *(preferred)*: If in legacy mode, the raw sum of squares of activations (sum of `Act²`). In preferred mode, the Euclidean Distance (L₂ Norm) between this tensor’s average activations and those of the previous layer.
* **Min / Max / μ / σ**: Tensor elements Min, Max, Mean, and Standard Deviation.
* **N**: Number of tensor elements considered.
* **H Norm**: Shannon Entropy normalized over log₂(N). Defined as $H Norm=\frac{-\sum_{i=1}^N p_i \log_2 p_i}{log_2 N}$. Used to determine how well a prompt "exercises" the model's capabilities.
* **H** *(legacy mode)* / **ECS** *(preferred)*: If legacy, Shannon Entropy defined as $H = -\sum_{i=1}^N p_i \log_2 p_i$. If preferred, *Euclidean-Cosine Score* defined as $ECS = K \cdot e^{-\alpha a} \cdot |b|^{\gamma}$ where `a = L₂ Norm`, `b = Cosine Similarity`, `α = 0.01`, `γ = 10` between this tensor’s elements and those of the previous layer. Higher score means more similarity and lower change.
* **ZD**: % of elements whose Z-score is > 1.0 in magnitude (an indicator of outliers), as described in _3.1 Layer Importance Scores_ of [Layer-Wise Quantization](https://arxiv.org/abs/2406.17415)
* **CosSim**: Cosine Similarity between this tensor’s elements and those of the previous layer.

#### Per layer

Weighted averages of Σ(Act²), ZD Score and CosSim are also calculated.
Aggregated metrics per block/layer:

#### Important note on the computed Statistics
* **Σ(Act²)** *(legacy mode)* / **L₂ Norm** *(preferred)*: If in legacy mode, the sum of squared activations (sum of Act²) for the layer's concatenated tensors. In preferred mode, the Euclidean Distance (L₂ Norm) between this layer's average concatenated tensor activations the previous layer.
* **ZD**: % of this layer's concatenated tensors' elements with |Z| > 1.
* **CosSim**: Cosine Similarity between this layer's concatenated tensors' elements compared and the previous layer’s.
* **ECS** *(preferred only)*: Euclidean-Cosine Score applied to the layer.

When using these statistics, please note that they are computed on the squared activations, **not on the actual (raw) activations**.
Whilst the results are still useful, they're less realiable than using the raw values, and in the case of the cosine similarity, could be misleading if the tensor contains opposite vectors.
More information is available in https://github.com/ggml-org/llama.cpp/pull/14891
Loading
Loading