Limitations of the current YaRN RoPE implementation and how to fix them #7416
Replies: 2 comments 2 replies
-
Thank you for the detailed information and taking a look into this
Maybe But I guess if there is a viable workaround that does not involve changing |
Beta Was this translation helpful? Give feedback.
-
Attention to the following change that will be merged soon: https://github.com/ggerganov/llama.cpp/pull/7225/files cparams.yarn_attn_factor *= hparams.rope_attn_factor; |
Beta Was this translation helpful? Give feedback.
-
I'd like to discuss the limitations of the current llama.cpp YaRN RoPE implementation affecting the implementation of DeepSeek-V2 support and possible ways to fix them. Let me start with an introduction to the problem.
YaRN RoPE modifies attention equation by adding a scaling factor t in the denominator:
The original paper mentioned a trick allowing to leave the attention implementation unmodified:
If I understand correctly most YaRN RoPE implementations use this "trick" in a form of mscale variable equal to$\sqrt{1/t} = 0.1 ln(s) + 1$ where s is the RoPE scaling factor. They multiply the calculated RoPE sin/cos values with mscale, which results in a query and key vectors scaled by $\sqrt{1/t}$ when RoPE rotations are applied, and their dot products scaled by $\sqrt{1/t} * \sqrt{1/t} = 1/t$ as required by the YaRN RoPE attention equation.
This "trick" works correctly when RoPE is applied to whole query and key vectors. But there are models like DeepSeek-V2 where RoPE is applied only to a part of the query/key vector. That leaves the remaining parts of they query/key vectors unscaled and results in attention output that is calculated incorrectly.
In the DeepSeek-V2 transformers implementation this problem is solved in a very confusing way, where the mscale calculation is modified in a way that results in a mscale value equal to 1.0 (note that self.mscale and self.mscale_all_dim are both set to 0.707 in config.json):
In this implementation applying RoPE rotations does not scale the key/query vectors at all. Instead they used the attention softmax_scale value and multiply it by mscale*mscale (so essentially by 1/t as required):
They also changed they way mscale is calculated:
This is extra confusing, since the mscale name is used for multiple different values in the configuration and the code.
Their paper is more clear, saying that they simply modified the way$\sqrt{t}$ is calculated to $\sqrt{t} = 0.0707 ln(s) + 1$ . By the way I think they have an error in the paper as it shall be $\sqrt{1/t}$ , not $\sqrt{t}$ . So it it looks like they changed the original 0.1 constant value from the original YaRN paper to 0.0707.
This is getting long, so let me sum up the problems I see in two points:
Note that this is not an urgent or crictical problem, I managed to work around these problems when implementing DeepSeek-V2 in llama.cpp by pre-scaling attn_factor passed to ggml_rope_custom() call so the resulting calculated mscale is 1.0 (like in the transformers implementation), calculating mscale myself and scaling kq_scale passed to llm_build_kv() by multiplying it with mscale*mscale. So it's a similar solution to the one DeepSeek-V2 authors used in the transformers code. But I'm not happy with this, in my opinion it's very ugly and confusing.
I guess the future will show if other models start using different constants values when calculating mscale like DeepSeek-V2 did. For now, I think we should at least find a name for the 0.1 constant value that DeepSeek-V2 changed to 0.0707. I don't want to call it mscale, any other suggestions (mscale_scale :D)?
Beta Was this translation helpful? Give feedback.
All reactions