quantize : configurable neutral imatrix prior #15060

compilade · 2025-08-03T21:49:06Z

Follow-up from #9400 (comment) to allow experimenting with different weights for the neutral prior with GGUF-based imatrix.

This should help avoid some NANs and/or unstable quantization with MoE models which have some rarely-used experts.

Basically, imatrix weights are per-channel averages of squared activations. This new feature here inserts 1 (the neutral weight value when no imatrix is provided) in that average with a configurable weight (in the sense of a weighted average).

The default is 1 (on master, this was technically 0), which means the neutral weight is worth as much as a single token from the calibration dataset.

This only works with GGUF-based imatrix files, because they store per-expert activation counts (unlike the imatrix.dat format).

What I don't know is if it would be better to use some different value than 1 token, and so I've made it configurable with --prior-weight to make it easier to test different values. ("prior weight" might not be an intuitive name, suggestions welcome. "neutral tokens", maybe?).

Example usage:

$ ./llama-quantize --imatrix imatrix.gguf --prior-weight 128 model-F16.gguf model-Q4_K_M.gguf q4_k_m

When --prior-weight is not specified, --prior-weight 1 is implied.

To get the same behavior as before this PR, --prior-weight 0 can be used.

TODO

Store metadata to make it clear what prior weight was used
- Done in 92383bf
Maybe rename "prior weight" to "neutral tokens"?
Test perplexity from imatrix.gguf with problematic MoE models and different prior weights
- Maybe models from Imatrix quantization bug: OLMo-2-0325-32B-Instruct found nan value #12439, and Misc. bug: Quantizing Olmo models with imatrix failing on some sizes #11764 (which might not have been really fixed yet)
- @bartowski1182, @ubergarm, @danielhanchen, @nicoboss, do you happen to have encountered NANs or other problems with some MoE models when using imatrix? Which models and datasets? Which quant type(s)? (even better if it was from a published imatrix.gguf file)

Make sure to read the contributing guidelines before submitting a PR

CISC · 2025-08-03T22:13:26Z

I remember this one being horribly broken: https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct

Edit: Here's an imatrix with NaNs https://huggingface.co/legraphista/Qwen2-57B-A14B-Instruct-IMat-GGUF

bartowski1182 · 2025-08-03T22:17:57Z

I have been having a horrible time with the latest Qwen3 30B, I've tried 10mb sized diverse datasets, even a custom one I made last time that I found REALLY forced diversity, no dice on any of them

I tried a ton of datasets and all failed on Q5_K even

I'll try to organize them and upload tomorrow for reference, running another test tonight

jukofyork · 2025-08-04T11:31:56Z

What I don't know is if it would be better to use some different value than 1 token

From my understanding, to fix the problems with MoE tensors - it's "should" not matter as the weighting factors are applied per 256 element block (or 32 for the legacy quants), and so long as all rows are a multiple of 256; any that get no samples during the imatrix creation "should" end up with equal weights.

I say "should" in inverted commas because this assumes that with no samples, the behaviour should revert to the unweighted case... But actually this is using the empirically found "mixture inside a square root method" explained here:

ikawrakow/ik_llama.cpp#140

weight[j] = qw[j] * sqrtf(sigma2 + xb[j]*xb[j])

so using an equal weight here for the MoE tensors with no samples simplifies to:

weight[j] = 1 * sqrtf(sigma2 + xb[j]*xb[j]) = sqrtf(sigma2 + xb[j]*xb[j]

rather than:

weight[j] = xb[j]*xb[j]

where:

const float * xbl = x + QK_K*ibl;
float sumx2 = 0;
for (int i = 0; i < QK_K; ++i) sumx2 += xbl[i]*xbl[i];
float sigma2 = 2*sumx2/QK_K;

If using qw[j] = 1 is still causing problems for MoE models, then it can only be here where the problem lies.

compilade · 2025-08-04T13:54:46Z

What I don't know is if it would be better to use some different value than 1 token

From my understanding, to fix the problems with MoE tensors - it's "should" not matter as the weighting factors are applied per 256 element block (or 32 for the legacy quants),

@jukofyork
If the row sizes are not multiples of the block sizes, then quantization cannot really happen anyway.

There is only one unique evaluation count per tensor when using MUL_MAT. The counts can only be distinct between 2D matrices when MUL_MAT_ID is used.

(Note that in this context, evaluation counts map to tokens, to avoid depending on a chunk size)

The number of neutral tokens (aka the weight of the neutral prior) makes a difference only when the evaluation count is non-zero, but is still small enough to have similar orders of magnitude.

I say "should" in inverted commas because this assumes that with no samples, the behaviour should revert to the unweighted case... But actually this is using the empirically found "mixture inside a square root method" explained here: ikawrakow/ik_llama.cpp#140

Right, this is different. But it's also kind of close to the same. In #12557 I also tried to use that (with sigma2) in the unweighted case, and mostly saw some improvement.
For some types, (e.g. Q3_K), it was actually better (or at least not significantly worse) to use 1.0f instead of xb[i] * xb[i] (or variants) in the unweighted case.

To be clear this PR doesn't change what happens when the evaluation count is 0, that was already changed in #9400, and has the behavior you're describing. This also only happens with MUL_MAT_ID (for MoE experts), because the counts only exist after the first collection (and so are minimally 1 for MUL_MAT).

What this PR changes is what happens when the evaluation count is small, to make the imatrix weights less impactful when the sample size isn't big enough (which often happens with MoE tensors when there are many experts).

The difference you're describing is still relevant, though, because it does mean the "neutral prior" is not quite like the unweighted case for some types. But it's not too far from that and the perplexity should be similar, based on what I've seen when working on #12557.

jukofyork · 2025-08-05T15:13:57Z

The number of neutral tokens (aka the weight of the neutral prior) makes a difference only when the evaluation count is non-zero, but is still small enough to have similar orders of magnitude.

I say "should" in inverted commas because this assumes that with no samples, the behaviour should revert to the unweighted case... But actually this is using the empirically found "mixture inside a square root method" explained here: ikawrakow/ik_llama.cpp#140

Right, this is different. But it's also kind of close to the same. In #12557 I also tried to use that (with sigma2) in the unweighted case, and mostly saw some improvement. For some types, (e.g. Q3_K), it was actually better (or at least not significantly worse) to use 1.0f instead of xb[i] * xb[i] (or variants) in the unweighted case.

Ah, thanks and I forgot about that other PR.

To be clear this PR doesn't change what happens when the evaluation count is 0, that was already changed in #9400, and has the behavior you're describing. This also only happens with MUL_MAT_ID (for MoE experts), because the counts only exist after the first collection (and so are minimally 1 for MUL_MAT).

What this PR changes is what happens when the evaluation count is small, to make the imatrix weights less impactful when the sample size isn't big enough (which often happens with MoE tensors when there are many experts).

The difference you're describing is still relevant, though, because it does mean the "neutral prior" is not quite like the unweighted case for some types. But it's not too far from that and the perplexity should be similar, based on what I've seen when working on #12557.

e[j*ne0 + i] = (((const float *) sums->data)[j*ne0 + i] + prior_weight) / (count + prior_weight);

is the expected value of sums->data)[j*ne0 + i] always 1 then, eg:

e[j*ne0 + i] = (((const float *) sums->data)[j*ne0 + i] + 1.0f * prior_weight) / (count + 1 * prior_weight);

one systematic way to try to set prior_weight would be to take lots of bootstrap samples (of the other tensors with more samples or using a very large dataset), for different values of count.

Then see if the log-transformed values for each count fits a normal distribution with standard deviation scaled by sqrt(count). If it does then you should be able to set prior_weight based on a 95% credible interval (or whatever you choose).

Even if it doesn't fit log-normal or other parameterised distribution, you can still do this for the empirical CDF provided enough data / bootstrap samples.

compilade · 2025-08-05T16:05:03Z

is the expected value of sums->data)[j*ne0 + i] always 1

@jukofyork
It's not. It gets bigger the deeper into the model. See #12718 (comment). If the expected value was 1, then Σ(Bias) (renamed to Σ(act²) since then, which is the row-wise sum of the in_sum2 / counts) would be close to the embedding size, but it's not.

The reason why I'm initially making the prior pull the weighted mean towards 1 is to reduce the variance of the imatrix weights (since (at least for q[i] * s quants, not sure about q[i] * s - m) relative amplitude is the only way the quant_weights affect quantization (based on experiments using https://github.com/compilade/rounding-experiments, scaling the quant_weights by a constant doesn't seem to affect anything, unlike when modifying the values in other ways)).

Writing the above, I realize pulling the weighted mean towards 1 is probably not the ideal approach to achieve that, and I will explore different ways to apply the prior weight.

one systematic way to try to set prior_weight would be to take lots of bootstrap samples

I didn't think about using bootstrap samples, that seems interesting.

Now I'm wondering if the experts in a MoE model handle substantially different activations or not. Intuitively I would guess they probably do, but maybe not that much. If they are similar enough, then using the imatrix data of other experts in the same stacked tensor could be viable.

Ooh, would ffn_gate_inp already contain the expected inputs for each expert? (maybe, but only for ffn_up_exps)

compilade · 2025-08-05T21:01:33Z

I got some numbers for Qwen3-30B-A3B-Instruct-2507 at Q2_K (when using calibration_datav3.txt):

imatrix	PPL (`wiki.test.raw`)
omit partial (legacy behavior)	8.4238
`--prior-weight 0`	8.2969
`--prior-weight 1`, towards 1	8.3047
`--prior-weight 16`, towards 1	8.3003
`--prior-weight 128`, towards 1	8.3205
`--prior-weight 0.125`, towards mean	8.2773
`--prior-weight 1`, towards mean	8.3648
`--prior-weight 16`, towards mean	8.5330
`--prior-weight 128`, towards mean	8.8913

So it turns out going towards the mean is too strong (or wrong?), and that going towards 1 in the weighted average is weaker (but more stable?).

I think I might revert 46a8601, since using the mean doesn't seem beneficial. Although apparently using a very small prior weight towards the mean gives good results, I'm not sure why. Might need to test more small values and/or try to understand why this happens. But maybe equal weights really are neutral weights, and going towards the mean too strongly ignores imatrix weights?

Next thing I'll try will probably be to think how to find a good prior weight based on a confidence score (e.g. 95%), instead of (or maybe still multiplied with) a fixed value (which otherwise makes the relative weight inversely proportional to counts), although I would need to assume a particular distribution for the imatrix weights.

jukofyork · 2025-08-05T21:24:02Z

Maybe try using the geometric mean instead?

jukofyork · 2025-08-07T09:56:27Z

I've been thinking about this over the last couple of days and I wonder if MoE tensors should actually use a 2-level prior:

No data for any MoE tensors = equal weights.
Small amount of data for a specific MoE tensor = use mostly the weighted average of all MoE tensors.
Large amount of data for a specific MoE tensor = use mostly the tensor's own data.

Now I'm wondering if the experts in a MoE model handle substantially different activations or not. Intuitively I would guess they probably do, but maybe not that much. If they are similar enough, then using the imatrix data of other experts in the same stacked tensor could be viable.

I don't think they will be all that different:

Each gating vector defines a "(half) hypercone" (not sure what the proper term is) in the very high-dimensional hidden_dim space.
The area of this space where the given MoE tensor "wins" is then defined by a Voronoi mesh over these "(half) hypercones".
I think it would be fairly unlikely that the distributions inside each of these would be wildly different for the remaining hidden_dim - 1 dimensions (especially given all the layer_norm operations).

How to do this in a principled way without adding extra hyper-parameters is another thing to think about though...

jukofyork · 2025-08-07T10:07:55Z

One other interesting thing to look at would be the rank correlation of the gating mechanism. Are the rarely chosen tensors being rarely chosen because:

They point in some wildly different direction to most of the hidden states (ie: their "(half) hypercone" defines an area of space that rarely catches anything).
Some other MoE tensor is "dominating" them and gets chosen either: because it matches the hidden state direction slightly better, or in the case of the newer MoE models; has a higher bias term, etc.

jukofyork · 2025-08-07T10:11:41Z

Maybe try using the geometric mean instead?

It's definitely worth trying this as:

The squared values are all strictly positive.
The odd looking --prior-weight 0.125 giving the best result may actually become much closer to --prior-weight 1.0 since the geometric mean is always less than the arithmetic mean.

quantize : configurable neutral imatrix prior

0416ed2

compilade added generation quality Quality of model output research 🔬 need feedback Testing and feedback with results are needed labels Aug 3, 2025

github-actions bot added the examples label Aug 3, 2025

quantize : store metadata for prior weight used for imatrix

92383bf

compilade mentioned this pull request Aug 5, 2025

context : fix index overflow on huge outputs #15080

Merged

compilade marked this pull request as draft August 5, 2025 16:17

compilade mentioned this pull request Aug 5, 2025

imatrix: calculate activation-based statistics for new format (GGUF) imatrices #14891

Draft

compilade added 2 commits August 5, 2025 13:34

quantize : assume the neutral prior is equal imatrix weights

46a8601

Merge branch 'master' into compilade/imatrix-neutral-prior

ea5e55d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

quantize : configurable neutral imatrix prior #15060

quantize : configurable neutral imatrix prior #15060

compilade commented Aug 3, 2025 •

edited

Loading

Uh oh!

CISC commented Aug 3, 2025 •

edited

Loading

Uh oh!

bartowski1182 commented Aug 3, 2025 •

edited

Loading

Uh oh!

jukofyork commented Aug 4, 2025 •

edited

Loading

Uh oh!

compilade commented Aug 4, 2025 •

edited

Loading

Uh oh!

jukofyork commented Aug 5, 2025 •

edited

Loading

Uh oh!

compilade commented Aug 5, 2025 •

edited

Loading

Uh oh!

compilade commented Aug 5, 2025 •

edited

Loading

Uh oh!

jukofyork commented Aug 5, 2025

Uh oh!

jukofyork commented Aug 7, 2025 •

edited

Loading

Uh oh!

jukofyork commented Aug 7, 2025

Uh oh!

jukofyork commented Aug 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

quantize : configurable neutral imatrix prior #15060

Are you sure you want to change the base?

quantize : configurable neutral imatrix prior #15060

Conversation

compilade commented Aug 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO

Uh oh!

CISC commented Aug 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bartowski1182 commented Aug 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

compilade commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

compilade commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

compilade commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented Aug 5, 2025

Uh oh!

jukofyork commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented Aug 7, 2025

Uh oh!

jukofyork commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

compilade commented Aug 3, 2025 •

edited

Loading

CISC commented Aug 3, 2025 •

edited

Loading

bartowski1182 commented Aug 3, 2025 •

edited

Loading

jukofyork commented Aug 4, 2025 •

edited

Loading

compilade commented Aug 4, 2025 •

edited

Loading

jukofyork commented Aug 5, 2025 •

edited

Loading

compilade commented Aug 5, 2025 •

edited

Loading

compilade commented Aug 5, 2025 •

edited

Loading

jukofyork commented Aug 7, 2025 •

edited

Loading

jukofyork commented Aug 7, 2025 •

edited

Loading