ToolGRPO: Teaching LLMs to Use Tools Through Reinforcement Learning
Most LLMs are trained to generate text. But what if you want them to do things—run code, call APIs, search the web? That's where tool use comes in.
The challenge: teaching a model when and how to use tools, not just that tools exist.
The Problem
Standard fine-tuning can teach an LLM the syntax of tool calls. But it doesn't teach judgment—knowing when a calculator is more reliable than mental math, or when to verify an answer by running code.
Reinforcement learning (RL) can help here. Instead of showing the model "here's the right answer," you let it explore and reward it when tools actually help solve the problem.
Our Approach: GRPO + Tool Execution
We built ToolGRPO, a modified version of Unsloth's GRPO (Group Relative Policy Optimization) algorithm that supports tool use during training.
Why GRPO?
GRPO is efficient for preference learning—it doesn't need a separate reward model. Instead, it compares outputs within a group and learns from relative quality. This makes it practical to run on modest hardware.
The Modification
Standard GRPO generates text and scores it. We extended this to:
- Generate a response (which may include Python code)
- Execute any code blocks in a sandboxed interpreter
- Check if the execution result solves the problem
- Reward based on correctness
Input: "What is the 10th Fibonacci number?"
Model output:
```python
def fib(n):
if n <= 1:
return n
return fib(n-1) + fib(n-2)
print(fib(10))
Execution: 55 Ground truth: 55 Reward: +1
The model learns that writing and running code leads to correct answers more reliably than trying to compute in-context.
## Technical Setup
**Base model:** Qwen2.5-0.5B-Instruct
We chose a small model deliberately. Tool use is a capability that should transfer—if we can teach a 0.5B model to use tools effectively, the same approach should work (better) on larger models.
**Fine-tuning:** LoRA with rank 128
Full fine-tuning would be expensive and risks catastrophic forgetting. LoRA lets us add tool-use capability without destroying the model's existing knowledge.
**Inference:** vLLM
GRPO requires generating many samples per training step. vLLM's batched inference makes this tractable.
**Tracking:** Weights & Biases
Essential for debugging RL training. We track reward curves, code execution success rates, and sample outputs.
## Dataset: AIME Problems
We train on AIME (American Invitational Mathematics Examination) problems. These are hard enough that the model can't reliably solve them through pure reasoning, but tractable if you write code.
Example problem:
> Find the number of positive integers n ≤ 1000 such that n² - n is divisible by some but not all of the integers 2, 3, 4, 5.
The model needs to:
1. Recognize this is easier to brute-force than prove analytically
2. Write correct Python to check each n
3. Count and return the result
## What We Learned
### 1. Reward shaping matters more than you'd think
Binary rewards (correct/incorrect) work but converge slowly. Adding partial credit for "code runs without errors" and "output is a number" helped significantly.
### 2. The model learns to verify
An unexpected behavior: after training, the model sometimes generates code to *check* its analytical answer rather than compute from scratch. It's learning that tools are for verification, not just computation.
### 3. Sandbox security is non-negotiable
Early versions had the model discover `os.system()`. A proper sandbox (restricted imports, execution timeouts, memory limits) is essential.
### 4. Small models can learn tool use
The 0.5B model genuinely learns when to use code vs. when to answer directly. It's not just cargo-culting—it makes reasonable decisions about tool applicability.
## Current Limitations
- **Single-turn only:** The model generates one response. Multi-turn tool use (run code, see output, iterate) is on the roadmap.
- **Python only:** We'd like to add web search, API calls, and other tools.
- **Math-focused:** AIME is a specific domain. Generalization to other tool-use scenarios needs more work.
## What's Next
The roadmap includes:
- **Multi-turn interactions:** Let the model see execution results and iterate
- **Multiple tool types:** Code execution + web search + calculators
- **Tool selection:** Teach the model to choose the right tool, not just use tools
## Try It Yourself
The code is Apache 2.0 licensed: [github.com/okaybroda/ToolGRPO](https://github.com/okaybroda/ToolGRPO)
You'll need:
- A GPU with ~16GB VRAM (for the 0.5B model with LoRA)
- Python 3.10+
- Patience for RL training
---
Tool use is one of the most promising directions for making LLMs genuinely useful. If you're working on similar problems or want to discuss approaches, [reach out](/contact).