Skip to content

add deep research agent benchmark script. #2135

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Aug 20, 2025
Merged

Conversation

lkk12014402
Copy link
Collaborator

Description

add benchmark script for evaluating the accuracy of deep research agent.

@Copilot Copilot AI review requested due to automatic review settings July 11, 2025 09:05
Copy link

github-actions bot commented Jul 11, 2025

Dependency Review

✅ No vulnerabilities or license issues found.

Scanned Files

  • DeepResearchAgent/requirements.txt

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds a benchmarking harness for measuring the accuracy of the Deep Research Agent by integrating dataset loaders, an LLM-based scoring function, and CLI support.

  • Wraps the agent’s raw output in a JSON field ("answer") in research_agent.py.
  • Introduces eval.py to load questions, invoke the agent, score answers via an LLM judge, and save results.
  • Updates documentation under benchmark/accuracy and the main README to cover setup and usage.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
DeepResearchAgent/research_agent.py Return payload now wrapped as {"answer": ...}
DeepResearchAgent/benchmark/accuracy/eval.py New benchmark script for accuracy evaluation
DeepResearchAgent/benchmark/accuracy/README.md Instructions for running the accuracy benchmark
DeepResearchAgent/README.md Added note on configuring deep_researcher.yaml in setup
Comments suppressed due to low confidence (2)

DeepResearchAgent/benchmark/accuracy/eval.py:93

  • [nitpick] There are no unit tests covering key functions like load_questions, process_single_question, or run_benchmark. Adding tests will help ensure correctness when loading datasets and computing accuracy.
def load_questions(dataset_names: list[str] | None = None) -> list[dict[str, str]]:

DeepResearchAgent/benchmark/accuracy/eval.py:334

  • The args.agent_config field is referenced in metadata but no --agent-config argument is defined in the parser. Consider adding a corresponding parser argument or renaming this field to match --service-url or another existing flag.
            "agent_config": args.agent_config,

@joshuayao joshuayao added this to the v1.4 milestone Jul 14, 2025
Copy link
Collaborator

@minmin-intel minmin-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some benchmark results in README? And add some recommendations about what LLMs to use with vllm-gaudi?

@lkk12014402
Copy link
Collaborator Author

Can you add some benchmark results in README? And add some recommendations about what LLMs to use with vllm-gaudi?

no problem. I will

@lkk12014402
Copy link
Collaborator Author

Can you add some benchmark results in README? And add some recommendations about what LLMs to use with vllm-gaudi?

add togethercomputer/together-search-bench benchmark accuracy data

@joshuayao joshuayao added this to OPEA Aug 13, 2025
@joshuayao joshuayao moved this to In review in OPEA Aug 13, 2025
@joshuayao
Copy link
Collaborator

Hi @lkk12014402 could you please help check the CI failures? Thanks.

@joshuayao joshuayao moved this from In review to In progress in OPEA Aug 14, 2025
@chensuyue chensuyue merged commit fe1527f into main Aug 20, 2025
19 checks passed
@chensuyue chensuyue deleted the deep_research_agent_benchmark branch August 20, 2025 03:37
@github-project-automation github-project-automation bot moved this from In progress to Done in OPEA Aug 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

7 participants