Skip to content

Conversation

DesmonDay
Copy link
Contributor

PR types

New features

PR changes

Others

Description

当前模型实现中,相同TP组下,self_attn.kv_a_proj_with_mqa, self_attn.q_a_proj 为 nn.Linear ,但对应的参数存在梯度不同步导致参数不同步的情况,即这类参数在不同TP rank下其实是独立的参数,所以自然保存下来的值会是不同的,对应到uc里头每个参数只能保存一次的情况,就会出现热启loss对不上的现象。

Copy link

paddle-bot bot commented Sep 6, 2025

Thanks for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant