Inquiry about the Self-Taught Evaluator

The paper mentions that over 20,000 instructions categorized as “reasoning” were ultimately selected. However, the number of “reasoning”-related instructions obtained from WildChat far exceeds 20,000. How can the final 20,000 instructions used in the study be selected from this larger set? Following the settings described in the paper, I trained the model for 2 epochs, but my experiments indicate that the peak performance on RewardBench appears to be achieved after about one epoch, after which it begins to decline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inquiry about the Self-Taught Evaluator #22

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inquiry about the Self-Taught Evaluator #22

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions