Authors: Xiaojun Guo*, Ang Li*, Yifei Wang*, Stefanie Jegelka Yisen Wang‡
*Equal Contribution. ‡Correspondence.
Our dataset contains 50 graph-theoretic tasks (100k train, 5k test) categorized by difficulty. Below shows the performance comparison between the base model (Qwen2.5-7B-Instruct
) and our RL-trained G1-7B model. G1-7B consistently outperforms the base model, with +37.07% average accuracy gain.
Difficulty | Tasks | Ratio | Base Model Acc | G1-7B Acc |
---|---|---|---|---|
Easy | Node Number, Dominating Set, Common Neighbor, Edge Number, Neighbor, BFS, Has Cycle, DFS, Minimum Spanning Tree, Edge Existence, Is Regular, Degree, Is Tournament, Density | 29.16% | 57.16% | 95.07% |
Medium | Adamic Adar Index, Clustering Coefficient, Connected Component Number, Bipartite Maximum Matching, Local Connectivity, Jaccard Coefficient, Min Edge Covering, Is Eularian, Degree Centrality, Is Bipartite, Resource Allocation Index | 22.91% | 42.55% | 88.91% |
Hard | Max Weight Matching, Closeness Centrality, Traveling Salesman Problem, Strongly Connected Number, Shortest Path, Center, Diameter, Barycenter, Radius, Topological Sort, Periphery, Betweenness Centrality, Triangles, Average Neighbor Degree, Harmonic Centrality, Bridges | 33.33% | 18.87% | 50.44% |
Challenging | Isomorphic Mapping, Global Efficiency, Maximal Independent Set, Maximum Flow, Wiener Index, Hamiltonian Path, Min Vertex Cover | 14.58% | 3.29% | 23.57% |
Abstract
Although Large Language Models (LLMs) have demonstrated remarkable progress, their proficiency in graph-related tasks remains notably limited, hindering the development of truly general-purpose models. Previous attempts, including pretraining graph foundation models or employing supervised fine-tuning, often face challenges such as the scarcity of large-scale, universally represented graph data. We introduce G1, a simple yet effective approach demonstrating that Reinforcement Learning (RL) on synthetic graph-theoretic tasks can significantly scale LLMs’ graph reasoning abilities. To enable RL training, we curate Erdős, the largest graph reasoning dataset to date comprising 50 diverse graph-theoretic tasks of varying difficulty levels, 100k training data and 5k test data, all drived from realworld graphs. With RL on Erdős, G1 obtains substantial improvements in graph reasoning, where our finetuned 3B model even outperforms Qwen2.5-72B-Instruct (24x size). RL-trained models also show strong zero-shot generalization to unseen tasks, domains, and graph encoding schemes, including other graph-theoretic benchmarks as well as real-world node classification and link prediction tasks, without compromising general reasoning abilities. Our findings offer an efficient, scalable path for building strong graph reasoners by finetuning LLMs with RL on graph-theoretic tasks, which combines the strengths of pretrained LLM capabilities with abundant, automatically generated synthetic data, suggesting that LLMs possess graph understanding abilities that RL can elicit successfully.1️⃣ Erdős: The Largest Graph-Theoretic Benchmark
We introduce Erdős, the largest and most comprehensive graph-theoretic dataset till now, featuring 50 diverse real-world graph tasks of varying complexities and a total of 100,000 samples. Designed for robust evaluation, Erdős provides a reliable platform for training and benchmarking graph reasoning models.
2️⃣ First RL-Driven Graph Reasoning Model
We pioneer the application of reinforcement learning (RL) to enhance LLMs on graph reasoning. Our G1 models achieve up to 46% improvement over baselines on Erdős, with the 7B variant matching OpenAI’s o3-mini and the 3B model surpassing Qwen2.5-72B-Instruct by notable margins.
3️⃣ Strong Generalization Without Compromise
G1 exhibits zero-shot generalization on unseen graph tasks, improving performance on other graph reasoning benchmarks (GraphWiz, GraphArena) and real-world graphs (Cora, PubMed). Crucially, it preserves general reasoning ability (GSM8K, MATH, MMLU-Pro), proving its versatility.
Our models and datasets are released in the huggingface collection PKU-ML/G1. In detail:
We release our dataset Erdős on the huggingface at PKU-ML/Erdos.
We release our models G1-series on the huggingface:
G1-3B
: PKU-ML/G1-3BG1-7B
: PKU-ML/G1-7BG1-Zero-3B
: PKU-ML/G1-Zero-3BG1-Zero-7B
: PKU-ML/G1-Zero-7BG1-Direct-SFT-3B
: PKU-ML/G1-Direct-SFT-3BG1-Direct-SFT-7B
: PKU-ML/G1-Direct-SFT-7BG1-CoT-SFT-3B
: PKU-ML/G1-CoT-SFT-3BG1-Cot-SFT-7B
: PKU-ML/G1-CoT-SFT-7B
We provide an example code for graph reasoning using our G1 models. Run the following:
python example.py
In this example, the task is to detemine the degree centrality for a given graph:
"The task is to determine the degree centrality of a node in the graph. Degree centrality for a node is the fraction of nodes it is connected to.
Here is an undirected graph containing nodes from 1 to 15. The edges are: (1, 15), (15, 11), (2, 3), (2, 6), (3, 6), (3, 7), (6, 7), (6, 8), (7, 8), (7, 14), (4, 10), (10, 5), (10, 12), (8, 14), (8, 9), (12, 11), (12, 13).
Question: What is the degree centrality of node 2 in the graph? You need to format your answer as a float number."
The response of G1-7B gives a correct answer of 0.1429 with clear steps:
"To determine the degree centrality of node 2 in the given undirected graph, we first need to count the number of edges connected to node 2. The degree centrality of a node is defined as the fraction of nodes it is connected to, which can be calculated by dividing the number of edges connected to the node by the total number of nodes minus one (since a node does not count itself).
Given the edges: - (1, 15) - (15, 11) - (2, 3) - (2, 6) - (3, 6) - (3, 7) - (6, 7) - (6, 8) - (7, 8) - (7, 14) - (4, 10) - (10, 5) - (10, 12) - (8, 14) - (8, 9) - (12, 11) - (12, 13). We identify the edges connected to node 2: - (2, 3) - (2, 6). This gives node 2 a degree of 2, since it is connected by two edges.
Next, we calculate the degree centrality by dividing the degree of node 2 by the total number of nodes minus one. The total number of nodes in the graph is 15.
Degree Centrality =$\frac{\text{Degree of node 2}}{15 - 1} = \frac{2}{14} = \frac{1}{7} \approx 0.1429$ . Therefore, the final answer is:$\boxed{0.1429}$ "
Our implementation is based on the library verl developed by ByteDance Seed team.
- Requirements:
- Python: Version >= 3.9
- CUDA: Version >= 12.1
verl supports various backends. Currently, the following configurations are available: FSDP and Megatron-LM (optional) for training. SGLang, vLLM and TGI for rollout generation. For more information, please check the documents of verl.
- For installing the dependencies, we recommend to use a fresh new conda environment:
conda create -n verl python==3.10
conda activate verl
- Then, execute the install.sh script that we provided in the repository:
# Make sure you have activated verl conda env
# If you need to run with megatron
bash install.sh
# Or if you simply need to run with FSDP
USE_MEGATRON=0 bash install.sh
If you encounter any issues during installation, please refer to the Installation Guide provided by Verl. If problems persist, don’t hesitate to report them to us.
For rejection sampling, we randomly sample 100 examples per task from Erdős, and use Qwen2.5-32B-Instruct to generate responses for k = 8 times with a temperature of 1.0. We filter the responses with right answers, and resample from Erdős with a different random seed and repeat the process above if necessary.
To run rejection sampling, execute the following command:
cd evaluation/Erdos
python rejection_sampling.py \
--n_sample 8 \
--model Qwen/Qwen2.5-32B-Instruct \
--save_path data/rft_data \
--temperature 1.0 \
--num_gpu 4 \
--seed 42
n_sample
: Number of responses to generate per example (k=8)model
: Path to generation modelsave_path
: Output directory for valid query-response pairstemperature
: Sampling temperature (fixed at 1.0 for diversity)
Note: We have uploaded the rejection sampled dataset to PKU-ML/Erdos-CoT for direct usage.
Transfer the format of data for SFT training:
python preprocess_sft_data.py \
--data_source PKU-ML/Erdos-CoT \
--save_dir datasets/graph_cot_sft \
--tokenizer_path Qwen/Qwen2.5-3B-Instruct
data_source
: The original data directorysave_dir
: The transferred data directorytokenizer_path
: The path for tokenizer
Run supervised finetuing with the following script:
bash run_sft.sh
Key customizable parameters for the training script:
save_path
: Output directory for the trained modeldata.train_files
anddata.val_files
: Paths to the processed dataset.model.partial_pretrain
: The path to the base model.
Follow these steps to reproduce our G1 implementation:
-
Download the dataset.
Download the dataset Erdős from PKU-ML/Erdos and save it to locally (e.g. data/erdos).
-
Preprocess the dataset for RL training.
Run the preprocessing script to convert the dataset format:
python preprocess_graph.py --data_source PKU-ML/Erdos --local_dir data/erdos
local_dir
: Output directory for the processed dataset.
-
Launch RL Training.
Execute the training script (3B model as an example):
bash run_3B.sh
Configuration notes for the training script:
- Key customizable parameters:
SAVE_DIR
: Output directory for the trained model.graph_train_path
andgraph_test_path
: Paths to the processed dataset.- Logging: Defaults to Tensorboard. To use Weights & Biases, set
trainer.logger = ['console','wandb']
.
- GPU requirements:
- Our paper used 8×A800 GPUs. For limited GPU resources, reduce these parameters (may affect performance):
See Performance Tuning Guide for more instructions.
actor_rollout_ref.actor.ppo_mini_batch_size actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu
- Set
trainer.n_gpus_per_node
to your actual GPU count.
Follow these steps to evaluate G1 models and baseline methods:
-
Convert RL-trained Checkpoints to HuggingFace Format
Merge the model checkpoints into a HuggingFace-compatible format:
python model_merger.py \ --backend fsdp \ --hf_model_path Qwen/Qwen2.5-3B-Instruct \ --local_dir models/G1_3B/global_step_300/actor \ --target_dir our_models/G1_3B
hf_model_path
: Path to the original base model.local_dir
: Directory containing RL-trained checkpoints.target_dir
: Output directory for the converted model.
-
Configure Tokenizer Files
Copy required tokenizer files from the base model:
cp models/Qwen2.5-3B-Instruct/tokenizer_config.json our_models/G1_3B/tokenizer_config.json cp models/Qwen2.5-3B-Instruct/tokenizer.json our_models/G1_3B/tokenizer.json cp models/Qwen2.5-3B-Instruct/vocab.json our_models/G1_3B/vocab.json
-
Run Comprehensive Evaluation
We provide evaluation scripts for multiple benchmarks:
Benchmark Type Included Tests Graph Reasoning Erdős, GraphWiz, GraphArena Graph Real-World Tasks Node Classification, Link Prediction General Reasoning GSM8K, MATH, MMLU-pro For detailed instructions, see the evaluation README.md .
To adapt G1 for your needs, we recommend modifying these key files from our verl-based implementation:
trainer/ # Core training components
- main_ppo.py # Main training entry point (handles PPO loop)
- ppo/reward.py # Reward computation and shaping logic
workers/ # Parallel processing components
- reward_manager/__init__.py # Reward calculation interface
- reward_manager/graph.py # Graph-specific reward functions
utils/ # Supporting utilities
- reward_score/__init__.py # Reward normalization/scaling
- reward_score/graph.py # Graph-specific scoring metrics
graph_meta_data.json # Dataset metadata and configuration
If you find this work useful, please give us a free cite:
@article{guo2025g1,
title={G1: Teaching LLMs to Reason on Graphs with Reinforcement Learning},
author={Guo, Xiaojun and Li, Ang and Wang, Yifei and Jegelka, Stefanie and Wang, Yisen},
journal={arXiv preprint arXiv:2505.18499},
year={2025}
}