Skip to content

quantize : configurable neutral imatrix prior #15060

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

compilade
Copy link
Collaborator

@compilade compilade commented Aug 3, 2025

Follow-up from #9400 (comment) to allow experimenting with different weights for the neutral prior with GGUF-based imatrix.

This should help avoid some NANs and/or unstable quantization with MoE models which have some rarely-used experts.

Basically, imatrix weights are per-channel averages of squared activations. This new feature here inserts 1 (the neutral weight value when no imatrix is provided) in that average with a configurable weight (in the sense of a weighted average).

The default is 1 (on master, this was technically 0), which means the neutral weight is worth as much as a single token from the calibration dataset.

This only works with GGUF-based imatrix files, because they store per-expert activation counts (unlike the imatrix.dat format).

What I don't know is if it would be better to use some different value than 1 token, and so I've made it configurable with --prior-weight to make it easier to test different values. ("prior weight" might not be an intuitive name, suggestions welcome. "neutral tokens", maybe?).

Example usage:

$ ./llama-quantize --imatrix imatrix.gguf --prior-weight 128 model-F16.gguf model-Q4_K_M.gguf q4_k_m

When --prior-weight is not specified, --prior-weight 1 is implied.

To get the same behavior as before this PR, --prior-weight 0 can be used.

TODO


Make sure to read the contributing guidelines before submitting a PR

@compilade compilade added generation quality Quality of model output research 🔬 need feedback Testing and feedback with results are needed labels Aug 3, 2025
@CISC
Copy link
Collaborator

CISC commented Aug 3, 2025

I remember this one being horribly broken: https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct

Edit: Here's an imatrix with NaNs https://huggingface.co/legraphista/Qwen2-57B-A14B-Instruct-IMat-GGUF

@bartowski1182
Copy link
Contributor

bartowski1182 commented Aug 3, 2025

I have been having a horrible time with the latest Qwen3 30B, I've tried 10mb sized diverse datasets, even a custom one I made last time that I found REALLY forced diversity, no dice on any of them

I tried a ton of datasets and all failed on Q5_K even

I'll try to organize them and upload tomorrow for reference, running another test tonight

@jukofyork
Copy link
Collaborator

jukofyork commented Aug 4, 2025

What I don't know is if it would be better to use some different value than 1 token

From my understanding, to fix the problems with MoE tensors - it's "should" not matter as the weighting factors are applied per 256 element block (or 32 for the legacy quants), and so long as all rows are a multiple of 256; any that get no samples during the imatrix creation "should" end up with equal weights.

I say "should" in inverted commas because this assumes that with no samples, the behaviour should revert to the unweighted case... But actually this is using the empirically found "mixture inside a square root method" explained here:

ikawrakow/ik_llama.cpp#140

weight[j] = qw[j] * sqrtf(sigma2 + xb[j]*xb[j])

so using an equal weight here for the MoE tensors with no samples simplifies to:

weight[j] = 1 * sqrtf(sigma2 + xb[j]*xb[j]) = sqrtf(sigma2 + xb[j]*xb[j]

rather than:

weight[j] = xb[j]*xb[j]

where:

const float * xbl = x + QK_K*ibl;
float sumx2 = 0;
for (int i = 0; i < QK_K; ++i) sumx2 += xbl[i]*xbl[i];
float sigma2 = 2*sumx2/QK_K;

If using qw[j] = 1 is still causing problems for MoE models, then it can only be here where the problem lies.

@compilade
Copy link
Collaborator Author

compilade commented Aug 4, 2025

What I don't know is if it would be better to use some different value than 1 token

From my understanding, to fix the problems with MoE tensors - it's "should" not matter as the weighting factors are applied per 256 element block (or 32 for the legacy quants),

@jukofyork
If the row sizes are not multiples of the block sizes, then quantization cannot really happen anyway.

There is only one unique evaluation count per tensor when using MUL_MAT. The counts can only be distinct between 2D matrices when MUL_MAT_ID is used.

(Note that in this context, evaluation counts map to tokens, to avoid depending on a chunk size)

The number of neutral tokens (aka the weight of the neutral prior) makes a difference only when the evaluation count is non-zero, but is still small enough to have similar orders of magnitude.

I say "should" in inverted commas because this assumes that with no samples, the behaviour should revert to the unweighted case... But actually this is using the empirically found "mixture inside a square root method" explained here: ikawrakow/ik_llama.cpp#140

Right, this is different. But it's also kind of close to the same. In #12557 I also tried to use that (with sigma2) in the unweighted case, and mostly saw some improvement.
For some types, (e.g. Q3_K), it was actually better (or at least not significantly worse) to use 1.0f instead of xb[i] * xb[i] (or variants) in the unweighted case.

To be clear this PR doesn't change what happens when the evaluation count is 0, that was already changed in #9400, and has the behavior you're describing. This also only happens with MUL_MAT_ID (for MoE experts), because the counts only exist after the first collection (and so are minimally 1 for MUL_MAT).

What this PR changes is what happens when the evaluation count is small, to make the imatrix weights less impactful when the sample size isn't big enough (which often happens with MoE tensors when there are many experts).

The difference you're describing is still relevant, though, because it does mean the "neutral prior" is not quite like the unweighted case for some types. But it's not too far from that and the perplexity should be similar, based on what I've seen when working on #12557.

@jukofyork
Copy link
Collaborator

jukofyork commented Aug 5, 2025

The number of neutral tokens (aka the weight of the neutral prior) makes a difference only when the evaluation count is non-zero, but is still small enough to have similar orders of magnitude.

I say "should" in inverted commas because this assumes that with no samples, the behaviour should revert to the unweighted case... But actually this is using the empirically found "mixture inside a square root method" explained here: ikawrakow/ik_llama.cpp#140

Right, this is different. But it's also kind of close to the same. In #12557 I also tried to use that (with sigma2) in the unweighted case, and mostly saw some improvement. For some types, (e.g. Q3_K), it was actually better (or at least not significantly worse) to use 1.0f instead of xb[i] * xb[i] (or variants) in the unweighted case.

Ah, thanks and I forgot about that other PR.

To be clear this PR doesn't change what happens when the evaluation count is 0, that was already changed in #9400, and has the behavior you're describing. This also only happens with MUL_MAT_ID (for MoE experts), because the counts only exist after the first collection (and so are minimally 1 for MUL_MAT).

What this PR changes is what happens when the evaluation count is small, to make the imatrix weights less impactful when the sample size isn't big enough (which often happens with MoE tensors when there are many experts).

The difference you're describing is still relevant, though, because it does mean the "neutral prior" is not quite like the unweighted case for some types. But it's not too far from that and the perplexity should be similar, based on what I've seen when working on #12557.

e[j*ne0 + i] = (((const float *) sums->data)[j*ne0 + i] + prior_weight) / (count + prior_weight);

is the expected value of sums->data)[j*ne0 + i] always 1 then, eg:

e[j*ne0 + i] = (((const float *) sums->data)[j*ne0 + i] + 1.0f * prior_weight) / (count + 1 * prior_weight);

one systematic way to try to set prior_weight would be to take lots of bootstrap samples (of the other tensors with more samples or using a very large dataset), for different values of count.

Then see if the log-transformed values for each count fits a normal distribution with standard deviation scaled by sqrt(count). If it does then you should be able to set prior_weight based on a 95% credible interval (or whatever you choose).

Even if it doesn't fit log-normal or other parameterised distribution, you can still do this for the empirical CDF provided enough data / bootstrap samples.

@compilade
Copy link
Collaborator Author

compilade commented Aug 5, 2025

is the expected value of sums->data)[j*ne0 + i] always 1

@jukofyork
It's not. It gets bigger the deeper into the model. See #12718 (comment). If the expected value was 1, then Σ(Bias) (renamed to Σ(act²) since then, which is the row-wise sum of the in_sum2 / counts) would be close to the embedding size, but it's not.

The reason why I'm initially making the prior pull the weighted mean towards 1 is to reduce the variance of the imatrix weights (since (at least for q[i] * s quants, not sure about q[i] * s - m) relative amplitude is the only way the quant_weights affect quantization (based on experiments using https://github.com/compilade/rounding-experiments, scaling the quant_weights by a constant doesn't seem to affect anything, unlike when modifying the values in other ways)).

Writing the above, I realize pulling the weighted mean towards 1 is probably not the ideal approach to achieve that, and I will explore different ways to apply the prior weight.

one systematic way to try to set prior_weight would be to take lots of bootstrap samples

I didn't think about using bootstrap samples, that seems interesting.

Now I'm wondering if the experts in a MoE model handle substantially different activations or not. Intuitively I would guess they probably do, but maybe not that much. If they are similar enough, then using the imatrix data of other experts in the same stacked tensor could be viable.

Ooh, would ffn_gate_inp already contain the expected inputs for each expert? (maybe, but only for ffn_up_exps)

@compilade
Copy link
Collaborator Author

compilade commented Aug 5, 2025

I got some numbers for Qwen3-30B-A3B-Instruct-2507 at Q2_K (when using calibration_datav3.txt):

imatrix PPL (wiki.test.raw)
omit partial (legacy behavior) 8.4238
--prior-weight 0 8.2969
--prior-weight 1, towards 1 8.3047
--prior-weight 16, towards 1 8.3003
--prior-weight 128, towards 1 8.3205
--prior-weight 0.125, towards mean 8.2773
--prior-weight 1, towards mean 8.3648
--prior-weight 16, towards mean 8.5330
--prior-weight 128, towards mean 8.8913

So it turns out going towards the mean is too strong (or wrong?), and that going towards 1 in the weighted average is weaker (but more stable?).

I think I might revert 46a8601, since using the mean doesn't seem beneficial. Although apparently using a very small prior weight towards the mean gives good results, I'm not sure why. Might need to test more small values and/or try to understand why this happens. But maybe equal weights really are neutral weights, and going towards the mean too strongly ignores imatrix weights?

Next thing I'll try will probably be to think how to find a good prior weight based on a confidence score (e.g. 95%), instead of (or maybe still multiplied with) a fixed value (which otherwise makes the relative weight inversely proportional to counts), although I would need to assume a particular distribution for the imatrix weights.

@jukofyork
Copy link
Collaborator

Maybe try using the geometric mean instead?

@jukofyork
Copy link
Collaborator

jukofyork commented Aug 7, 2025

I've been thinking about this over the last couple of days and I wonder if MoE tensors should actually use a 2-level prior:

  • No data for any MoE tensors = equal weights.
  • Small amount of data for a specific MoE tensor = use mostly the weighted average of all MoE tensors.
  • Large amount of data for a specific MoE tensor = use mostly the tensor's own data.

Now I'm wondering if the experts in a MoE model handle substantially different activations or not. Intuitively I would guess they probably do, but maybe not that much. If they are similar enough, then using the imatrix data of other experts in the same stacked tensor could be viable.

I don't think they will be all that different:

  • Each gating vector defines a "(half) hypercone" (not sure what the proper term is) in the very high-dimensional hidden_dim space.
  • The area of this space where the given MoE tensor "wins" is then defined by a Voronoi mesh over these "(half) hypercones".
  • I think it would be fairly unlikely that the distributions inside each of these would be wildly different for the remaining hidden_dim - 1 dimensions (especially given all the layer_norm operations).

How to do this in a principled way without adding extra hyper-parameters is another thing to think about though...

@jukofyork
Copy link
Collaborator

One other interesting thing to look at would be the rank correlation of the gating mechanism. Are the rarely chosen tensors being rarely chosen because:

  1. They point in some wildly different direction to most of the hidden states (ie: their "(half) hypercone" defines an area of space that rarely catches anything).
  2. Some other MoE tensor is "dominating" them and gets chosen either: because it matches the hidden state direction slightly better, or in the case of the newer MoE models; has a higher bias term, etc.

@jukofyork
Copy link
Collaborator

jukofyork commented Aug 7, 2025

Maybe try using the geometric mean instead?

It's definitely worth trying this as:

  1. The squared values are all strictly positive.
  2. The odd looking --prior-weight 0.125 giving the best result may actually become much closer to --prior-weight 1.0 since the geometric mean is always less than the arithmetic mean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples generation quality Quality of model output need feedback Testing and feedback with results are needed research 🔬
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants