Skip to content

GAIR-NLP/ToRL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ToRL: Scaling Tool-Integrated RL

📄 Paper   |   🌐 Dataset   |   📘 Model

torl-abstarct-1

Performance comparison of ToRL versus baseline models(16-step moving). Both plots show AIME24 Accuracy (%) against training steps across 1.5B and 7B models. In both cases, ToRL(Ours) significantly outperforms the baseline without tool and Qwen-2.5-Math-Instruct-TIR, achieving up to 12%(1.5B) and 14%(7B) higher.

torl-abstarct-2

Emergent cognitive behavior during training. ToRL first cross-validates the tool's output with reasoning results. Upon detecting inconsistencies, it engages in reflection and further verification through tool calls.

Releases

[2025/03/28] We're releasing the following components:

  • 🚀 Training: Complete implementation of our training pipeline
  • 🔥 ToRL Dataset: Our curated dataset of 28k mathematical questions
  • 🤖 ToRL Model: Model training with ToRL.

Overview

This repository presents ToRL (Tool-Integrated Reinforcement Learning), a framework that challenges traditional approaches to tool integration in language models by enabling LLMs to autonomously discover and refine tool usage strategies through reinforcement learning. Unlike prior methods constrained by supervised fine-tuning or predefined tool patterns, ToRL demonstrates that exploration-driven learning with computational tools can unlock emergent cognitive behaviors and achieve state-of-the-art performance on complex reasoning tasks. Notably, our approach operates directly from base models without imitation learning, achieving 43.3% accuracy on AIME2024 with a 7B model—matching the performance of larger 32B models trained with RL.

Key Findings

  • Autonomous Tool Integration: Models learn when and how to invoke tools (e.g., code interpreters) through RL-driven exploration, eliminating dependency on human-curated tool usage patterns.
  • Emergent Cognitive Abilities:
    • Self-correction by cross-validating code execution results with reasoning steps
    • Adaptive strategy selection between tool-based and pure-reasoning approaches
    • Self-regulation of ineffective tool calls without explicit supervision

ToRL Performance

1.5B Model Performance across challenging mathematical benchmarks:

Model SFT/RL Tool AIME24 AIME25 MATH500 Olympiad AMC23 Avg
Qwen2.5-Math-1.5B-Instruct RL 10.0 10.0 66.0 31.0 62.5 35.9
Qwen2.5-Math-1.5B-Instruct-TIR RL 13.3 13.3 73.8 41.3 55.0 41.3
ToRL-1.5B(Ours) RL 26.7 (+13.3) 26.7 (+13.3) 77.8 (+3.0) 44.0 (+2.7) 67.5 (+5.0) 48.5 (+7.2)

7B Model Performance across challenging mathematical benchmarks:

Model SFT/RL Tool AIME24 AIME25 MATH500 Olympiad AMC23 Avg
Qwen2.5-Math-7B-Instruct RL 10.0 16.7 74.8 32.4 65.0 39.8
Qwen2.5-Math-7B-Instruct-TIR RL 26.7 16.7 78.8 45.0 70.0 47.4
SimpleRL-Zero RL 33.3 6.7 77.2 37.6 62.5 43.5
rStar-Math-7B SFT 26.7 - 78.4 47.1 47.5 -
Eurus-2-7B-PRIME RL 26.7 13.3 79.2 42.1 57.4 43.1
ToRL-7B(Ours) RL 43.3 (+10.0) 30.0 (+13.3) 82.2 (+3.0) 49.9 (+2.8) 75.0 (+5.0) 62.1 (+14.7)

Cognitive Behavior via RL Scaling

In the left figure, the code initially generate by the model encountered an execution error, then it is corrected by model and is successfully executed.

In the right figure, the model first derives an incorrect result based on natural language reasoning, then discovers the error during code verification and makes corrections

cases cases2

Quick Start

Preparing the Sandbox Environment

According to the instructions provided in https://github.com/bytedance/SandboxFusion, install SandboxFusion and launch it.

# The following command can be used to install the sandbox environment 
# to avoid dependency conflicts.  
# The sandbox environment must be named "sandbox-runtime".  
conda create -n sandbox-runtime python==3.11  
pip install -r runtime/python/requirement.txt  

pip install poetry  
poetry install  
mkdir -p docs/build  
make run-online  

Replace the sandbox_url on line 109 of verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py with the your sandbox.

Environment setup

pip install -r requirements.txt
pip install wandb jsonlines math-verify hydra-core==1.4.0.dev1 sortedcontainers qwen-agent[code_interpreter] qwen-agent[python_executor]

Training

Execute bash scripts/torl_1.5b to run ToRL.

Acknowledgements

Our work builds upon the insightful technical reports from DeepSeek R1 and Kimi-k1.5 teams. We extend our appreciation to the Qwen-Math team for their open-source model, to the creators of VeRL and vLLM for providing the essential reinforcement learning framework and inference infrastructure, respectively, that enabled this research. and to the Qwen-Agent and Sandbox Fusion team, which provided the necessary tools for our research.

Citation

If you find this work useful, please cite our paper:

@misc{li2025torlscalingtoolintegratedrl,
      title={ToRL: Scaling Tool-Integrated RL}, 
      author={Xuefeng Li and Haoyang Zou and Pengfei Liu},
      year={2025},
      eprint={2503.23383},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.23383}, 
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published