Skip to content

Commit 7ec24d5

Browse files
committed
[Update] Doc update for Sep 25 refactor of DataFlow
1 parent 7dbd55f commit 7ec24d5

File tree

14 files changed

+1479
-993
lines changed

14 files changed

+1479
-993
lines changed

docs/.vuepress/navbars/en/index.ts

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -84,14 +84,14 @@ export const enNavbar = defineNavbarConfig([
8484
activeMatch: '^/guide/'
8585
},
8686
{
87-
text: "Agentic RAG Pipeline-Alpha",
88-
link: "/en/notes/guide/pipelines/AgenticRAGPipeline.md",
87+
text: "Doc-to-QA Pipeline",
88+
link: "/en/notes/guide/pipelines/Doc2QAPipeline.md",
8989
icon: "solar:palette-round-linear",
9090
activeMatch: '^/guide/'
9191
},
9292
{
93-
text: "Agentic RAG Pipeline-Beta",
94-
link: "/en/notes/guide/pipelines/AgenticRAGPipeline2.md",
93+
text: "Agentic RAG Pipeline",
94+
link: "/en/notes/guide/pipelines/AgenticRAGPipeline.md",
9595
icon: "solar:palette-round-linear",
9696
activeMatch: '^/guide/'
9797
},

docs/.vuepress/navbars/zh/index.ts

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -86,14 +86,14 @@ export const zhNavbar = defineNavbarConfig([
8686
activeMatch: '^/guide/'
8787
},
8888
{
89-
text: "Agentic RAG数据合成流水线-Alpha",
90-
link: "/zh/notes/guide/pipelines/AgenticRAGPipeline.md",
89+
text: "Doc-to-QA数据合成流水线",
90+
link: "/zh/notes/guide/pipelines/Doc2QAPipeline.md",
9191
icon: "solar:palette-round-linear",
9292
activeMatch: '^/guide/'
9393
},
9494
{
95-
text: "Agentic RAG数据合成流水线-Beta",
96-
link: "/zh/notes/guide/pipelines/AgenticRAGPipeline2.md",
95+
text: "Agentic RAG数据合成流水线",
96+
link: "/zh/notes/guide/pipelines/AgenticRAGPipeline.md",
9797
icon: "solar:palette-round-linear",
9898
activeMatch: '^/guide/'
9999
},

docs/.vuepress/notes/en/guide.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,8 +52,8 @@ export const Guide: ThemeNote = defineNoteConfig({
5252
"TextPipeline",
5353
"ReasoningPipeline",
5454
"Text2SqlPipeline",
55+
"Doc2QAPipeline",
5556
"AgenticRAGPipeline",
56-
"AgenticRAGPipeline2",
5757
"RAREPipeline",
5858
"KnowledgeBaseCleaningPipeline",
5959
"FuncCallPipeline",

docs/.vuepress/notes/zh/guide.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,8 +52,8 @@ export const Guide: ThemeNote = defineNoteConfig({
5252
"TextPipeline",
5353
"ReasoningPipeline",
5454
"Text2SqlPipeline",
55+
"Doc2QAPipeline",
5556
"AgenticRAGPipeline",
56-
"AgenticRAGPipeline2",
5757
"RAREPipeline",
5858
"KnowledgeBaseCleaningPipeline",
5959
"FuncCallPipeline",

docs/en/notes/guide/domain_specific_operators/agenticrag_operators.md

Lines changed: 33 additions & 202 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ permalink: /en/guide/agenticrag_operators/
88

99
## Overview
1010

11-
AgenticRAG Operators are a specialized suite of tools designed for agentic RAG (Retrieval-Augmented Generation) tasks, with a particular focus on generating question-and-answer (QA) samples from provided text to support RL-based agentic RAG training. These operators are primarily categorized into two groups: **Data Generation Operators (Generators)** and **Processing Operators (Processors)**.
11+
AgenticRAG Operators are a specialized suite of tools designed for agentic RAG (Retrieval-Augmented Generation) tasks, with a particular focus on generating question-and-answer (QA) samples from provided text to support RL-based agentic RAG training. These operators are primarily categorized into two groups: **Data Generation Operators (Generators)** and **Evaluating Operators (Evaluators)**.
1212

1313
- 🚀 **Independent Innovation**: Core algorithms developed from scratch, filling existing algorithmic gaps or further improving performance, breaking through current performance bottlenecks.
1414
-**Open Source First**: First integration of this operator into mainstream community frameworks, facilitating use by more developers and achieving open-source sharing.
@@ -28,31 +28,19 @@ Data Generation Operators are responsible for producing RAG-related RL training
2828
</thead>
2929
<tbody>
3030
<tr>
31-
<td class="tg-0pky">AutoPromptGenerator🚀</td>
32-
<td class="tg-0pky">Prompt Synthesis</td>
33-
<td class="tg-0pky">Generates prompts for question and answer creation tailored to specific content by leveraging large language models.</td>
34-
<td class="tg-0pky">-</td>
35-
</tr>
36-
<tr>
37-
<td class="tg-0pky">AtomicTaskGenerator✨</td>
31+
<td class="tg-0pky">AgenticRAGAtomicTaskGenerator✨</td>
3832
<td class="tg-0pky">Atomic Task Generation</td>
3933
<td class="tg-0pky">Generates high-quality questions and verifiable answers based on the given text content.</td>
4034
<td class="tg-0pky">Refined and improved from <a href="https://github.com/OPPO-PersonalAI/TaskCraft" target="_blank">https://github.com/OPPO-PersonalAI/TaskCraft</a></td>
4135
</tr>
4236
<tr>
43-
<td class="tg-0pky">QAGenerator✨</td>
44-
<td class="tg-0pky">Question and Answer Generation</td>
45-
<td class="tg-0pky">Produces questions and answers for given text content using large language models and generated prompts.</td>
46-
<td class="tg-0pky">-</td>
47-
</tr>
48-
<tr>
49-
<td class="tg-0pky">WidthQAGenerator✨</td>
37+
<td class="tg-0pky">AgenticRAGWidthQAGenerator✨</td>
5038
<td class="tg-0pky">QA Breadth Expansion</td>
5139
<td class="tg-0pky">Combines multiple QA pairs to generate new, more difficult QA pairs.</td>
5240
<td class="tg-0pky">Refined and improved from <a href="https://github.com/OPPO-PersonalAI/TaskCraft" target="_blank">https://github.com/OPPO-PersonalAI/TaskCraft</a></td>
5341
</tr>
5442
<tr>
55-
<td class="tg-0pky">DepthQAGenerator✨</td>
43+
<td class="tg-0pky">AgenticRAGDepthQAGenerator✨</td>
5644
<td class="tg-0pky">QA Depth Expansion</td>
5745
<td class="tg-0pky">Expands individual QA pairs into new, more challenging QA pairs.</td>
5846
<td class="tg-0pky">Refined and improved from <a href="https://github.com/OPPO-PersonalAI/TaskCraft" target="_blank">https://github.com/OPPO-PersonalAI/TaskCraft</a></td>
@@ -75,52 +63,22 @@ Data evaluation operators are responsible for assessing reinforcement learning t
7563
</thead>
7664
<tbody>
7765
<tr>
78-
<td class="tg-0pky">QAScorer✨</td>
79-
<td class="tg-0pky">QA Scoring</td>
80-
<td class="tg-0pky">Evaluates the quality of questions, answer consistency, answer verifiability, and downstream utility for QA pairs and their related content.</td>
81-
<td class="tg-0pky">-</td>
82-
</tr>
83-
<tr>
84-
<td class="tg-0pky">F1Scorer🚀</td>
66+
<td class="tg-0pky">AgenticRAGQAF1SampleEvaluator🚀</td>
8567
<td class="tg-0pky">QA Scoring</td>
8668
<td class="tg-0pky">Assesses the verifiability of answers with and without the presence of gold documents in QA tasks.</td>
8769
<td class="tg-0pky">-</td>
8870
</tr>
8971
</tbody>
9072
</table>
9173

92-
93-
## Processing Operators
94-
95-
Processing Operators are mainly tasked with choosing suitable data.
96-
97-
<table class="tg">
98-
<thead>
99-
<tr>
100-
<th class="tg-0pky">Name</th>
101-
<th class="tg-0pky">Application Type</th>
102-
<th class="tg-0pky">Description</th>
103-
<th class="tg-0pky">Official Repository or Paper</th>
104-
</tr>
105-
</thead>
106-
<tbody>
107-
<tr>
108-
<td class="tg-0pky">ContentChooser🚀</td>
109-
<td class="tg-0pky">Content chooser</td>
110-
<td class="tg-0pky">Selects a subset of content from a larger collection for further processing within the pipeline.</td>
111-
<td class="tg-0pky">-</td>
112-
</tr>
113-
</tbody>
114-
</table>
115-
11674
## Operator Interface Usage Instructions
11775

11876
Specifically, for operators that specify storage paths or call models, we provide encapsulated **model interfaces** and **storage object interfaces**. You can predefine model API parameters for operators in the following way:
11977

12078
```python
12179
from dataflow.llmserving import APILLMServing_request
12280

123-
api_llm_serving = APILLMServing_request(
81+
llm_serving = APILLMServing_request(
12482
api_url="your_api_url",
12583
model_name="model_name",
12684
max_workers=5
@@ -140,44 +98,15 @@ from dataflow.utils.storage import FileStorage
14098
)
14199
```
142100

143-
The `api_llm_serving` and `self.storage` used in the following text are the interface objects defined here. Complete usage examples can be found in `test/test_agentic_rag.py`.
101+
The `llm_serving` and `self.storage` used in the following text are the interface objects defined here. Complete usage examples can be found in `DataFlow/dataflow/statics/pipelines/api_pipelines/agentic_rag_pipeline.py`.
144102

145103
For parameter passing, the constructor of operator objects mainly passes information related to operator configuration, which can be configured once and called multiple times; while the `X.run()` function passes `key` information related to IO. Details can be seen in the operator description examples below.
146104

147105
## Detailed Operator Descriptions
148106

149107
### Data Generation Operators
150108

151-
#### 1. AutoPromptGenerator
152-
153-
**Function Description:** This operator is specifically designed to generate specialized prompts for creating question-and-answer pairs based on given text content.
154-
155-
**Input Parameters:**
156-
157-
- `__init__()`
158-
- `llm_serving`: Large language model interface object to use (default: predefined value above)
159-
- `run()`
160-
- `storage`: Storage interface object (default: predefined value above)
161-
- `input_key`: Input text content field name (default: "text")
162-
- `output_key`: Output generated prompt field name (default: "generated_prompt")
163-
164-
**Key Features:**
165-
166-
- Supports multiple types of text contents
167-
- Automatically generates suitable prompts
168-
169-
**Usage Example:**
170-
171-
```python
172-
prompt_generator = AutoPromptGenerator(api_llm_serving)
173-
result = prompt_generator.run(
174-
storage = self.storage.step(),
175-
input_key = "text",
176-
output_key = "generated_prompt"
177-
)
178-
```
179-
180-
#### 2. AtomicTaskGenerator
109+
#### 1. AgenticRAGAtomicTaskGenerator
181110

182111
**Function Description:** This operator is used to generate appropriate high-quality questions and verifiable answers for the provided text content.
183112

@@ -203,47 +132,17 @@ result = prompt_generator.run(
203132
**Usage Example:**
204133

205134
```python
206-
atomic_task_gen = AtomicTaskGenerator(llm_serving=api_llm_serving)
207-
result = atomic_task_gen.run(
208-
storage = self.storage.step(),
209-
input_key = "text",
210-
)
211-
```
212-
213-
#### 3. QAGenerator
214-
215-
**Function Description:** This operator generates a pair of question and answer for a special content.
216-
217-
**Input Parameters:**
218-
219-
- `__init__()`
220-
- `llm_serving`: Large language model interface object to use (default: predefined value above)
221-
- `run()`
222-
- `storage`: Storage interface object (default: predefined value above)
223-
- `input_key`: Input text content field name (default: "text")
224-
- `prompt_key`: Output answer field name (default: "generated_prompt")
225-
- `output_quesion_key`: Output answer field name (default: "generated_question")
226-
- `output_answer_key`: Output answer field name (default: "generated_answer")
227-
228-
**Key Features:**
229-
230-
- Supports multiple types of text contents
231-
- Generates suitable pairs of questions and answers
232-
233-
**Usage Example:**
135+
atomic_task_generator = AgenticRAGAtomicTaskGenerator(
136+
llm_serving=self.llm_serving
137+
)
234138

235-
```python
236-
qa_gen = QAGenerator(llm_serving=api_llm_serving)
237-
result = qa_gen.run(
139+
result = atomic_task_generator.run(
238140
storage = self.storage.step(),
239-
input_key="text",
240-
prompt_key="generated_prompt",
241-
output_quesion_key="generated_question",
242-
output_answer_key="generated_answer"
243-
)
141+
input_key = "contents",
142+
)
244143
```
245144

246-
#### 4. WidthQAGenerator
145+
#### 2. AgenticRAGWidthQAGenerator
247146

248147
**Function Description:** This operator is used to combine two QA pairs and generate a new question.
249148

@@ -265,16 +164,19 @@ result = qa_gen.run(
265164
**Usage Example:**
266165

267166
```python
268-
width_qa_gen = WidthQAGenerator(llm_serving=api_llm_serving)
269-
result = width_qa_gen.run(
167+
width_qa_generator = AgenticRAGWidthQAGenerator(
168+
llm_serving=self.llm_serving
169+
)
170+
171+
result = width_qa_generator.run(
270172
storage = self.storage.step(),
271173
input_question_key = "question",
272174
input_identifier_key = "identifier",
273175
input_answer_key = "refined_answer"
274176
)
275177
```
276178

277-
#### 5. DepthQAGenerator
179+
#### 3. AgenticRAGDepthQAGenerator
278180

279181
**Function Description:** This operator is used to generate deeper questions based on existing QA pairs.
280182

@@ -294,8 +196,11 @@ result = width_qa_gen.run(
294196
**Usage Example:**
295197

296198
```python
297-
depth_qa_gen = DepthQAGenerator(llm_serving=api_llm_serving)
298-
result = depth_qa_gen.run(
199+
depth_qa_generator = AgenticRAGDepthQAGenerator(
200+
llm_serving=self.llm_serving
201+
)
202+
203+
result = depth_qa_generator.run(
299204
storage = self.storage.step(),
300205
input_key = "question",
301206
output_key = "depth_question"
@@ -304,48 +209,7 @@ result = depth_qa_gen.run(
304209

305210
### Data Evaluation Operators
306211

307-
#### 1. QAScorer
308-
309-
**Function Description:** This operator generates multiple evaluation scores for the produced question-and-answer pairs.
310-
311-
**Input Parameters:**
312-
313-
- `__init__()`
314-
- `llm_serving`: Large language model interface object to use (default: predefined value above)
315-
- `run()`
316-
- `storage`: Storage interface object (default: predefined value above)
317-
- `input_question_key`: Input text content field name containing the generated questions (default: "generated_question")
318-
- `input_answer_key`: Input text content field name containing the generated answers (default: "generated_answer")
319-
- `output_question_quality_key`: Output field name for question quality grades (default: "question_quality_grades")
320-
- `output_question_quality_feedback_key`: Output field name for detailed feedback on question quality (default: "question_quality_feedbacks")
321-
- `output_answer_alignment_key`: Output field name for answer alignment grades (default: "answer_alignment_grades")
322-
- `output_answer_alignment_feedback_key`: Output field name for detailed feedback on answer alignment (default: "answer_alignment_feedbacks")
323-
- `output_answer_verifiability_key`: Output field name for answer verifiability grades (default: "answer_verifiability_grades")
324-
- `output_answer_verifiability_feedback_key`: Output field name for detailed feedback on answer verifiability (default: "answer_verifiability_feedbacks")
325-
- `output_downstream_value_key`: Output field name for downstream value grades (default: "downstream_value_grades")
326-
- `output_downstream_value_feedback_key`: Output field name for detailed feedback on downstream value (default: "downstream_value_feedbacks")
327-
328-
**Key Features:**
329-
330-
- Generates multiple useful scores for further filtering
331-
332-
**Usage Example:**
333-
334-
```python
335-
qa_scorer = QAScorer(llm_serving=api_llm_serving)
336-
result = qa_scorer.run(
337-
storage = self.storage.step(),
338-
input_question_key="generated_question",
339-
input_answer_key="generated_answer",
340-
output_question_quality_key="question_quality_grades",
341-
output_question_quality_feedback_key="question_quality_feedbacks",
342-
output_answer_alignment_key="answer_alignment_grades",
343-
output_answer_alignment_feedback_key="answer_alignment_feedbacks",
344-
output_answer_verifiability_key="answer_verifiability_grades",
345-
)
346-
```
347-
348-
#### 2. F1Scorer
212+
#### 1. AgenticRAGQAF1SampleEvaluator
349213

350214
**Function Description:** This operator is used to evaluate the verifiability of QA tasks with and without support from gold documents.
351215

@@ -366,44 +230,11 @@ result = qa_scorer.run(
366230
**Usage Example:**
367231

368232
```python
369-
f1_scorer = F1Scorer(llm_serving=api_llm_serving)
233+
f1_scorer = AgenticRAGQAF1SampleEvaluator()
370234
result = f1_scorer.run(
371-
storage = self.storage.step(),
372-
prediction_key = "refined_answer",
373-
ground_truth_key = "golden_doc_answer",
374-
output_key = "F1Score",
375-
)
376-
```
377-
378-
### Processing Operators
379-
380-
#### 1. ContentChooser
381-
382-
**Function Description:** This operator identifies and selects representative text content from a set of text contexts.
383-
384-
**Input Parameters:**
385-
386-
- `init()`
387-
- `num_samples`: Numuber of choosen samples
388-
- `method`: The method used to select from the original text contents (default: 'random')
389-
- `embedding_server`: The server to generate embeddings for text contents
390-
- `run()`
391-
- `storage`: Storage interface object (default: predefined value above)
392-
- `input_key`: Input text content field name (default: "text")
393-
394-
**Key Features:**
395-
396-
- Supports random choose and kmean choose
397-
- Supports a lot of embedding models
398-
399-
**Usage Example:**
400-
401-
```python
402-
embedding_serving = LocalModelLLMServing_vllm(hf_model_name_or_path="your_embedding_model_path", vllm_max_tokens=8192)
403-
404-
content_chooser = ContentChooser(num_samples = 5, method = "kcenter", embedding_serving=embedding_serving)
405-
result = content_chooser.run(
406-
storage = self.storage.step(),
407-
input_key = "text",
408-
)
235+
storage=self.storage.step(),
236+
output_key="F1Score",
237+
input_prediction_key="refined_answer",
238+
input_ground_truth_key="golden_doc_answer"
239+
)
409240
```

0 commit comments

Comments
 (0)