Skip to content

Inquiry about the Self-Taught Evaluator #22

@sunrise224

Description

@sunrise224

The paper mentions that over 20,000 instructions categorized as “reasoning” were ultimately selected. However, the number of “reasoning”-related instructions obtained from WildChat far exceeds 20,000. How can the final 20,000 instructions used in the study be selected from this larger set? Following the settings described in the paper, I trained the model for 2 epochs, but my experiments indicate that the peak performance on RewardBench appears to be achieved after about one epoch, after which it begins to decline.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions