-
Notifications
You must be signed in to change notification settings - Fork 231
Description
Hi, MiniMax team,
Congratulations on your great work! We have been following your recently published results with great interest — it is an exciting and impactful contribution to the field of large reasoning models.
We would like to bring to your attention a related paper from our team, which shares similar concepts and ideas with the CISPO approach you proposed: CPGD: Toward Stable Rule-based Reinforcement Learning for Language Models (https://arxiv.org/abs/2505.12504).
In our work, specifically in Section 6.1 “Importance Sampling” of the Discussion, we introduced a stop-gradient version of importance sampling ratio into the policy gradient loss and incorporated a clipping mechanism into the policy gradient loss, which is conceptually aligned with the core ideas of CISPO. Our code is also open-sourced at: https://github.com/ModalMinds/MM-EUREKA.
Given the conceptual overlap and complementary insights, we believe it may be of interest and relevance to your work. If you find it appropriate, we would greatly appreciate it if you could consider citing our paper in a future revision or publication.
We look forward to seeing more insightful work from your team!