Recursive Agent Optimization (RAO)

How RAO Works

Recursive agent inference

A recursive agent is implemented as an extension of an agent that interleaves natural-language reasoning with code execution in a Python REPL. We extend the action space with an asynchronous primitive:

async launch_subagent(goal, ...) -> Any

which launches a new instance of the same policy on a delegated sub-task. Because the return type is unrestricted, parents can request structured outputs in whatever format is most useful. Sub-agents can be launched sequentially (when later steps depend on earlier results) or concurrently via standard asyncio when sub-tasks are independent.

Local node reward with delegation bonus

Each node in the execution tree receives a local reward based on its own task signal, plus an optional delegation bonus tied to the success rate of its immediate children:

R(X, \tau_X) \;=\; \underbrace{\tilde{s}(X,\tau_X)}_{\text{success}(X)\text{ / proxy}} \;+\; \underbrace{\lambda \cdot \frac{1}{|C(X)|} \sum_{c \in C(X)} \tilde{s}(c,\tau_c)}_{\text{delegation bonus}}

The first term rewards solving the assigned task; the second rewards delegating to children that successfully solve their sub-tasks. Using success rate rather than raw counts avoids rewarding the agent for spawning more children purely for bonus. Setting $\lambda=0$ recovers a purely local-success-based reward.

RAO reward design diagram — **Reward design.** Each node combines its own success with a bonus from its children's success rate (here λ = 0.4).

Optimization: a self-induced curriculum

Recursive execution induces a family of related task distributions across depths. Let $\mathcal{D}_0$ denote the root task distribution and $\mathcal{D}_d(\theta)$ the distribution of depth-$d$ sub-tasks generated by recursively applying the current policy. RAO optimizes a single shared policy across all of them:

J(\theta) \;=\; \sum_{d=0}^{D} \mathbb{E}_{X \sim \mathcal{D}_d(\theta)} \left[ \mathbb{E}_{\tau_X \sim \pi_\theta(\cdot \mid X)} \big[ R(X,\tau_X) \big] \right]

This can be viewed as a multi-task objective where deeper nodes tend to be easier, structured sub-problems of their parents. In effect, the agent generates its own natural curriculum during training.

Advantages are computed using a leave-one-out baseline over root-rollout rewards, applied to all trajectories within a rollout tree: $A(\tau^{(g)}) = R(\tau^{(g)}) - b_{-g},\;\; b_{-g} = \tfrac{1}{G-1} \sum_{g' \neq g} R^{(g')}_{\mathrm{root}}$. To prevent deeper levels (which can vastly outnumber root trajectories) from dominating updates, we apply depth-level inverse-frequency weighting, downweighting trajectories from over-represented depths while preserving the overall update scale.

Summary: RAO at a glance

Trains a single shared policy over recursive rollouts with dynamically generated execution trees.
Provides dense credit assignment via a local reward at each node.
Computes advantages with a leave-one-out baseline on root rewards.
Optimizes a weighted, multi-task objective across depths, yielding a self-induced curriculum.

Results

We evaluate RAO on three benchmarks: TextCraft-Synth (a controllable Minecraft-style crafting benchmark we introduce), Oolong-Real (long-context QA over very long Dungeons & Dragons transcripts), and DeepDive (deep research). Across all three, recursive agents trained with RAO beat RL-trained single-agent baselines.

24% → 95%

TextCraft-Synth (8K context): single-agent vs. recursive-agent overall success rate.

0.20 → 0.32

Oolong-Real average reward: recursive agents process 55K–175K-token inputs despite a 32K training context limit.

+16 pp

DeepDive: improvement in held-out success rate over the RL-trained single-agent baseline.

2.5×

Faster wall-clock time on hard TextCraft-Synth tasks via concurrent sub-agent execution.

TextCraft-Synth

Qwen-3-4B-Instruct · controlled crafting tasks · trained on medium difficulty

TextCraft-Synth is a synthetic benchmark we introduce, inspired by TextCraft. The agent is given an initial inventory and must craft a target item using procedurally generated recipes. Crafting is naturally compositional and recursive: an item required for the target may itself need to be crafted from sub-components. Tasks come at three difficulty levels by depth of the underlying crafting tree (Easy 2–3, Medium 4–6, Hard 7–9). All training uses only medium-difficulty tasks; we evaluate generalization to easy and hard at test time.

TextCraft crafting tree visualization — A **TextCraft** crafting tree: to craft a beehive, the agent must first craft its sub-components, recursively. Figure adapted from Prasad et al. (2024).

TextCraft training curves — Training curves on TextCraft-Synth in both **constrained (8K)** and **unconstrained (40K)** context settings. Recursive agents learn faster and reach higher final performance.

Success rate (SR) across evaluation difficulties. Steps and wall-clock time are computed over the intersection of tasks solved by both methods at each difficulty. Bold indicates the best in each row.

(a) Context Window: 8K train, 8K eval
Difficulty	Method	SR	Steps	Time (s)
All	Single	0.24	16	7.1
All	Recursive	0.95	33	9.9
Easy	Single	0.55	12	5.2
Easy	Recursive	1.00	23	8.0
Medium	Single	0.17	25	11.1
Medium	Recursive	0.96	52	13.6
Hard	Single	0.00	—	—
Hard	Recursive	0.88	—	—

(b) Context Window: 40K train, 256K eval
Difficulty	Method	SR	Steps	Time (s)
All	Single	0.73	54	35.7
All	Recursive	0.96	115	19.8
Easy	Single	0.97	11	6.6
Easy	Recursive	1.00	21	8.8
Medium	Single	0.87	60	38.1
Medium	Recursive	0.98	109	20.9
Hard	Single	0.20	252	180.0
Hard	Recursive	0.88	694	73.3

Oolong-Real

Qwen3-VL-30B-A3B-Instruct · long-context D&D transcript QA · 32K training context

Oolong-Real requires aggregating information from very long Dungeons & Dragons transcripts. Training is constrained to 32K tokens (a Tinker training-API limit), but instances require processing at least ~55K tokens. A single agent cannot fit the full input and must rely on heuristics like regex or selective printing. A recursive agent can chunk the input across sub-agents, each with a fresh 32K context.

Oolong-Real training curves — Training curves on Oolong-Real. The recursive agent (orange) substantially outperforms the single agent (green) despite the same 32K training context limit.

Average reward across context-length buckets (650-sample evaluation, bucketed 55K–175K). Steps and time on the common non-zero-score intersection.

Method	Avg.	55K	118K	175K	Steps	Time (s)
Single	0.203	0.351	0.183	0.129	7.1	12.6
Recursive	0.320	0.454	0.315	0.249	61.5	175.4

Notably, our 30B recursive model approaches the Oolong performance of much larger frontier models — Claude-Sonnet-4 (0.37), o3 (0.37), and GPT-5-mini (0.35).

DeepDive

Qwen-3-4B-Instruct · multi-hop deep-research QA · 75 training steps

DeepDive contains challenging QA pairs constructed by performing controlled walks over knowledge graphs, requiring multi-hop, iterative web searches and synthesis over information scattered across the web. Sub-tasks here are sequentially dependent rather than parallelizable: only 1.6% of delegations were concurrent (vs. 83.9% on TextCraft-Synth).

DeepDive training curves. The recursive agent learns substantially faster than the single agent; the gap continues to widen with more training.

DeepDive evaluation on 50 held-out tasks.
Method	SR	Steps	Time (s)
Single	0.24	5.2	13.3
Recursive	0.40	121.2	233.0

On tasks solved by both methods the average max depth was 2.9; on tasks uniquely solved by the recursive agent it was 4 — the agent learns to allocate more test-time compute to harder questions.

Recursive agents adapt delegation depth to task difficulty

Maximum delegation depth on TextCraft — Maximum delegation depth reached by successful rollouts on TextCraft-Synth. Although training caps depth at 6, the agent learns to scale to greater depths when solving harder problems.

RAO teaches the agent when and how much to delegate. On TextCraft-Synth, the depth distribution of successful rollouts tracks task difficulty closely. On Oolong-Real, almost all successful rollouts use depth 1 — intuitive, since long-context aggregation is best handled by chunking once. On DeepDive, harder tasks elicit deeper delegation. Across all three, the agent learns task-appropriate delegation rather than applying recursion uniformly.

Ablations

Ablation: reward design and depth weighting — Ablation of RAO design choices on TextCraft-Synth medium. Both **dense rewards** (root + sub-agent) and **depth-level inverse-frequency weighting** are important for training efficiency.

We ablate the two key design choices in RAO. Dense rewards (using both root and sub-agent task-specific rewards) substantially outperform sparse propagation of root rewards alone. Independently, depth-level inverse-frequency weighting improves over the unweighted variant by preventing deeper, more populated levels of the tree from dominating the gradient.

Key takeaways

→

Beyond the context window

Recursive agents solve tasks whose inputs exceed the model's training context — by chunking and delegating to sub-agents.

→

Better training efficiency

Dense, structured sub-agent rewards plus a self-induced curriculum yield faster and stronger learning even when context is unconstrained.

→

Generalizes to harder tasks

Trained on medium difficulty, recursive agents reach 88% SR on hard tasks — vs. 20% for the single-agent baseline.

→

Adaptive test-time compute

Recursion depth scales with task difficulty: easy problems stay shallow, hard problems unlock deeper trees.

→

Faster on parallel tasks

Up to 2.5× wall-clock speedup on hard TextCraft tasks via concurrent asyncio sub-agent launches.

→

A general principle

Inference-time scaffolds shouldn't just be designed around models — models should be trained to use them.