-
Notifications
You must be signed in to change notification settings - Fork 28
Open
Description
The paper mentions that over 20,000 instructions categorized as “reasoning” were ultimately selected. However, the number of “reasoning”-related instructions obtained from WildChat far exceeds 20,000. How can the final 20,000 instructions used in the study be selected from this larger set? Following the settings described in the paper, I trained the model for 2 epochs, but my experiments indicate that the peak performance on RewardBench appears to be achieved after about one epoch, after which it begins to decline.
Metadata
Metadata
Assignees
Labels
No labels