Limitations of the current YaRN RoPE implementation and how to fix them #7416

fairydreaming · 2024-05-20T12:45:56Z

fairydreaming
May 20, 2024
Collaborator

I'd like to discuss the limitations of the current llama.cpp YaRN RoPE implementation affecting the implementation of DeepSeek-V2 support and possible ways to fix them. Let me start with an introduction to the problem.

YaRN RoPE modifies attention equation by adding a scaling factor t in the denominator:

$$softmax(\frac{q^T_m k_n}{t \sqrt{|D|}})$$

The original paper mentioned a trick allowing to leave the attention implementation unmodified:

we can instead use a "length scaling" trick which scales both $q_m$ and $k_n$ by a constant factor $\sqrt{1/t}$ by simply scaling the complex RoPE embeddings by the same amount.

If I understand correctly most YaRN RoPE implementations use this "trick" in a form of mscale variable equal to $\sqrt{1/t} = 0.1 ln(s) + 1$ where s is the RoPE scaling factor. They multiply the calculated RoPE sin/cos values with mscale, which results in a query and key vectors scaled by $\sqrt{1/t}$ when RoPE rotations are applied, and their dot products scaled by $\sqrt{1/t} * \sqrt{1/t} = 1/t$ as required by the YaRN RoPE attention equation.

This "trick" works correctly when RoPE is applied to whole query and key vectors. But there are models like DeepSeek-V2 where RoPE is applied only to a part of the query/key vector. That leaves the remaining parts of they query/key vectors unscaled and results in attention output that is calculated incorrectly.

In the DeepSeek-V2 transformers implementation this problem is solved in a very confusing way, where the mscale calculation is modified in a way that results in a mscale value equal to 1.0 (note that self.mscale and self.mscale_all_dim are both set to 0.707 in config.json):

        _mscale = float(
            yarn_get_mscale(self.scaling_factor, self.mscale)
            / yarn_get_mscale(self.scaling_factor, self.mscale_all_dim)
        )

In this implementation applying RoPE rotations does not scale the key/query vectors at all. Instead they used the attention softmax_scale value and multiply it by mscale*mscale (so essentially by 1/t as required):

                mscale = yarn_get_mscale(scaling_factor, mscale_all_dim)
                self.softmax_scale = self.softmax_scale * mscale * mscale

They also changed they way mscale is calculated:

def yarn_get_mscale(scale=1, mscale=1):
    if scale <= 1:
        return 1.0
    return 0.1 * mscale * math.log(scale) + 1.0

This is extra confusing, since the mscale name is used for multiple different values in the configuration and the code.

Their paper is more clear, saying that they simply modified the way $\sqrt{t}$ is calculated to $\sqrt{t} = 0.0707 ln(s) + 1$. By the way I think they have an error in the paper as it shall be $\sqrt{1/t}$, not $\sqrt{t}$. So it it looks like they changed the original 0.1 constant value from the original YaRN paper to 0.0707.

This is getting long, so let me sum up the problems I see in two points:

The current llama.cpp YaRN RoPE implementation does not work correctly if RoPE is applied only to a part of key/value vectors. The problem is that in this case only a part of the key/value vectors is scaled instead of whole vectors.
The current YaRN RoPE implementation does not allow to change constants used to calculate $\sqrt{1/t} = 0.1 ln(s) + 1$ - (In DeepSeek-V2 0.1 was changed to 0.0707).

Note that this is not an urgent or crictical problem, I managed to work around these problems when implementing DeepSeek-V2 in llama.cpp by pre-scaling attn_factor passed to ggml_rope_custom() call so the resulting calculated mscale is 1.0 (like in the transformers implementation), calculating mscale myself and scaling kq_scale passed to llm_build_kv() by multiplying it with mscale*mscale. So it's a similar solution to the one DeepSeek-V2 authors used in the transformers code. But I'm not happy with this, in my opinion it's very ugly and confusing.

I guess the future will show if other models start using different constants values when calculating mscale like DeepSeek-V2 did. For now, I think we should at least find a name for the 0.1 constant value that DeepSeek-V2 changed to 0.0707. I don't want to call it mscale, any other suggestions (mscale_scale :D)?

ggerganov · 2024-05-21T10:02:26Z

ggerganov
May 21, 2024
Maintainer

Thank you for the detailed information and taking a look into this

For now, I think we should at least find a name for the 0.1 constant value that DeepSeek-V2 changed to 0.0707. I don't want to call it mscale, any other suggestions (mscale_scale :D)?

Maybe mscale_factor?

But I guess if there is a viable workaround that does not involve changing ggml_rope and as long as only DS2 does not follow the canonical implementation, it might be better to do the workaround, even if it is a bit ugly. If more models start to change this constant in the future, then we could make the change

2 replies

ggerganov May 21, 2024
Maintainer

Btw, the new Phi-3 models seem to be introducing a new rope scaling factor - not sure if this is the one discussed here: https://huggingface.co/microsoft/Phi-3-small-128k-instruct/blob/main/config.json#L111

fairydreaming May 21, 2024
Collaborator Author

From vLLM implementation it looks like it's simply a precomputed mscale value, so it's a little different. I guess using a precomputed mscale value is a way out of this mess - just a single parameter is needed and everyone uses "mscale" name for this. For now I'll wait for the merge of Phi-3 changes to see if they affect the DeepSeek-V2 implementation.

ggerganov · 2024-05-21T16:02:51Z

ggerganov
May 21, 2024
Maintainer

Attention to the following change that will be merged soon:

https://github.com/ggerganov/llama.cpp/pull/7225/files

cparams.yarn_attn_factor *= hparams.rope_attn_factor;

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Limitations of the current YaRN RoPE implementation and how to fix them #7416

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Limitations of the current YaRN RoPE implementation and how to fix them #7416

Uh oh!

Uh oh!

fairydreaming May 20, 2024 Collaborator

Replies: 2 comments · 2 replies

Uh oh!

ggerganov May 21, 2024 Maintainer

Uh oh!

ggerganov May 21, 2024 Maintainer

Uh oh!

fairydreaming May 21, 2024 Collaborator Author

Uh oh!

ggerganov May 21, 2024 Maintainer

fairydreaming
May 20, 2024
Collaborator

Replies: 2 comments 2 replies

ggerganov
May 21, 2024
Maintainer

ggerganov May 21, 2024
Maintainer

fairydreaming May 21, 2024
Collaborator Author

ggerganov
May 21, 2024
Maintainer