[doc] Best Practice

## Result 

Here is our final train result which **50% SFT data comes from GraphGen**. 

| Domain | Dataset | our-7B-model | Qwen2.5-7B-Instruct	|
| :-: | :-: | :-: | :-: |
| Plant| [SeedBench](https://github.com/open-sciencelab/SeedBench) | **65.9** | 51.5 |
| Common | CMMLU | 73.6 | **75.8** |
| Logic | GPQA-Diamond | **40.0** | 33.3 |
| Math | AIME24 | **20.6** | 16.7 |
| | AIME25 | **22.7** | 7.2 |

## Garbage in, garbage out

First, it's essential to ensure the **high quality of the input chunk**. 
- Positive example: A complete small story segment
- Negative example: A part of a paper citation, only have title, lack of information

Secondly, filter the QA pairs according to business needs. The synthetic QA data contains entity words, but not every entity word should be present.
- Positive example: The glorious deeds of the company's boss
- Negative example: Meaningless coreference resolution.. "fig 5.1", "it"

## API usage
 
1. Make sure LLM API supports `logprobs` (such as `vllm serve` with `v0.6.6post1`) and enable `Trainee Model` for hardcase mining. SiliconCloud on OpenXLab web is just for *free* trial, real production would not be free.![Image](https://github.com/user-attachments/assets/8ccc4e22-3797-478b-b114-04599dddad20)

![Image](https://github.com/user-attachments/assets/57d5e427-71c2-49ec-ac1a-d51ad9a836cc)

2. Use a bigger synthesizer model. **Ensure that the synthesizer and the trainee are of the same origin.**

![Image](https://github.com/user-attachments/assets/2ccc87e0-b8c5-45f6-a67d-6519e1e99659)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[doc] Best Practice #17

Result

Garbage in, garbage out

API usage

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Domain	Dataset	our-7B-model	Qwen2.5-7B-Instruct
Plant	SeedBench	65.9	51.5
Common	CMMLU	73.6	75.8
Logic	GPQA-Diamond	40.0	33.3
Math	AIME24	20.6	16.7
	AIME25	22.7	7.2

[doc] Best Practice #17

Description

Result

Garbage in, garbage out

API usage

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions