公式3的REINFORCE公式是怎么从基本的累计奖励的形式推导出来的。

![Image](https://github.com/user-attachments/assets/0eecee1e-777e-4427-ae6a-10c7c99896c8)

如上是常见的公式形式

是怎么得到下面这种形式的？求一份详细的推导

![Image](https://github.com/user-attachments/assets/e18c9e23-b1b7-4c05-96c8-55c0403288f4)

另外论文中说的：these tokens were clipped out after the first on-policy update, preventing them from contributing to subsequent off-policy gradient updates. 低概率高变化量的token为什么在第一轮更新中被裁剪掉呢？clip之后变为1-\EPSILON，但是后边的优势计算的时候没有涉及到可更新梯度的参数吗？那是不是还是会对整体梯度有影响呢？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

公式3的REINFORCE公式是怎么从基本的累计奖励的形式推导出来的。 #6

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

公式3的REINFORCE公式是怎么从基本的累计奖励的形式推导出来的。 #6

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions