Recursive Agents · Reinforcement Learning

Recursive Agent Optimization (RAO)

Training LLM agents to use recursion as a learned inference-time primitive.

Apurva Gandhi1 · Satyaki Chakraborty2 · Xiangjun Wang2 · Aviral Kumar1 · Graham Neubig1
1Carnegie Mellon University  ·  2Amazon AGI Labs
Recursive agent inference. An agent can decide to spawn and delegate sub-problems to sub-agents, each of which can recursively delegate further. This induces a dynamically structured execution tree where each node corresponds to one agent instance attempting an assigned task. RAO optimizes a single policy to act across all levels of this tree, teaching it when and how to delegate.

TL;DR

The idea

How do we train a model to best take advantage of sub-agents at inference time? RAO is an end-to-end reinforcement learning approach for training a single LLM to spawn, delegate to, and coordinate with recursive copies of itself — turning recursive inference into a learned capability.

Why train for recursion?

01

Larger effective memory

Each sub-agent receives a fresh context window, expanding total usable working memory beyond the model's native limit.

02

Divide & conquer

Hard problems can be broken into easier sub-problems, with each sub-agent specializing on a piece of the task.

03

Parallelism

Independent sub-tasks can run concurrently, reducing wall-clock latency for many real-world tasks.

Abstract

We introduce Recursive Agent Optimization (RAO), a reinforcement learning approach for training recursive agents: agents that can spawn and delegate sub-tasks to new instantiations of themselves recursively. Recursive agents implement an inference-time scaling algorithm that naturally allows agents to scale to longer contexts and generalize to more difficult problems via divide-and-conquer. RAO provides a method to train models to best take advantage of such recursive inference, teaching agents when and how to delegate and communicate. We find that recursive agents trained in this way enjoy better training efficiency, can scale to tasks that go beyond the model's context window, generalize to tasks much harder than the ones the agent was trained on, and can enjoy reduced wall-clock time compared to single-agent systems.

How RAO Works

Recursive agent inference

A recursive agent is implemented as an extension of an agent that interleaves natural-language reasoning with code execution in a Python REPL. We extend the action space with an asynchronous primitive:

async launch_subagent(goal, ...) -> Any

which launches a new instance of the same policy on a delegated sub-task. Because the return type is unrestricted, parents can request structured outputs in whatever format is most useful. Sub-agents can be launched sequentially (when later steps depend on earlier results) or concurrently via standard asyncio when sub-tasks are independent.

Local node reward with delegation bonus

Each node in the execution tree receives a local reward based on its own task signal, plus an optional delegation bonus tied to the success rate of its immediate children:

$$ R(X, \tau_X) \;=\; \underbrace{\tilde{s}(X,\tau_X)}_{\text{success}(X)\text{ / proxy}} \;+\; \underbrace{\lambda \cdot \frac{1}{|C(X)|} \sum_{c \in C(X)} \tilde{s}(c,\tau_c)}_{\text{delegation bonus}} $$

The first term rewards solving the assigned task; the second rewards delegating to children that successfully solve their sub-tasks. Using success rate rather than raw counts avoids rewarding the agent for spawning more children purely for bonus. Setting $\lambda=0$ recovers a purely local-success-based reward.

RAO reward design diagram
Reward design. Each node combines its own success with a bonus from its children's success rate (here λ = 0.4).

Optimization: a self-induced curriculum

Recursive execution induces a family of related task distributions across depths. Let $\mathcal{D}_0$ denote the root task distribution and $\mathcal{D}_d(\theta)$ the distribution of depth-$d$ sub-tasks generated by recursively applying the current policy. RAO optimizes a single shared policy across all of them:

$$ J(\theta) \;=\; \sum_{d=0}^{D} \mathbb{E}_{X \sim \mathcal{D}_d(\theta)} \left[ \mathbb{E}_{\tau_X \sim \pi_\theta(\cdot \mid X)} \big[ R(X,\tau_X) \big] \right] $$

This can be viewed as a multi-task objective where deeper nodes tend to be easier, structured sub-problems of their parents. In effect, the agent generates its own natural curriculum during training.

Advantages are computed using a leave-one-out baseline over root-rollout rewards, applied to all trajectories within a rollout tree: $A(\tau^{(g)}) = R(\tau^{(g)}) - b_{-g},\;\; b_{-g} = \tfrac{1}{G-1} \sum_{g' \neq g} R^{(g')}_{\mathrm{root}}$. To prevent deeper levels (which can vastly outnumber root trajectories) from dominating updates, we apply depth-level inverse-frequency weighting, downweighting trajectories from over-represented depths while preserving the overall update scale.

Summary: RAO at a glance

  • Trains a single shared policy over recursive rollouts with dynamically generated execution trees.
  • Provides dense credit assignment via a local reward at each node.
  • Computes advantages with a leave-one-out baseline on root rewards.
  • Optimizes a weighted, multi-task objective across depths, yielding a self-induced curriculum.

Results

We evaluate RAO on three benchmarks: TextCraft-Synth (a controllable Minecraft-style crafting benchmark we introduce), Oolong-Real (long-context QA over very long Dungeons & Dragons transcripts), and DeepDive (deep research). Across all three, recursive agents trained with RAO beat RL-trained single-agent baselines.

24% → 95%
TextCraft-Synth (8K context): single-agent vs. recursive-agent overall success rate.
0.20 → 0.32
Oolong-Real average reward: recursive agents process 55K–175K-token inputs despite a 32K training context limit.
+16 pp
DeepDive: improvement in held-out success rate over the RL-trained single-agent baseline.
2.5×
Faster wall-clock time on hard TextCraft-Synth tasks via concurrent sub-agent execution.

TextCraft-Synth

Qwen-3-4B-Instruct · controlled crafting tasks · trained on medium difficulty

TextCraft-Synth is a synthetic benchmark we introduce, inspired by TextCraft. The agent is given an initial inventory and must craft a target item using procedurally generated recipes. Crafting is naturally compositional and recursive: an item required for the target may itself need to be crafted from sub-components. Tasks come at three difficulty levels by depth of the underlying crafting tree (Easy 2–3, Medium 4–6, Hard 7–9). All training uses only medium-difficulty tasks; we evaluate generalization to easy and hard at test time.

TextCraft crafting tree visualization
A TextCraft crafting tree: to craft a beehive, the agent must first craft its sub-components, recursively. Figure adapted from Prasad et al. (2024).
TextCraft training curves
Training curves on TextCraft-Synth in both constrained (8K) and unconstrained (40K) context settings. Recursive agents learn faster and reach higher final performance.

Success rate (SR) across evaluation difficulties. Steps and wall-clock time are computed over the intersection of tasks solved by both methods at each difficulty. Bold indicates the best in each row.

(a) Context Window: 8K train, 8K eval
Difficulty Method SR Steps Time (s)
All Single0.24167.1
Recursive0.95339.9
Easy Single0.55125.2
Recursive1.00238.0
Medium Single0.172511.1
Recursive0.965213.6
Hard Single0.00
Recursive0.88
(b) Context Window: 40K train, 256K eval
Difficulty Method SR Steps Time (s)
All Single0.735435.7
Recursive0.9611519.8
Easy Single0.97116.6
Recursive1.00218.8
Medium Single0.876038.1
Recursive0.9810920.9
Hard Single0.20252180.0
Recursive0.8869473.3

Oolong-Real

Qwen3-VL-30B-A3B-Instruct · long-context D&D transcript QA · 32K training context

Oolong-Real requires aggregating information from very long Dungeons & Dragons transcripts. Training is constrained to 32K tokens (a Tinker training-API limit), but instances require processing at least ~55K tokens. A single agent cannot fit the full input and must rely on heuristics like regex or selective printing. A recursive agent can chunk the input across sub-agents, each with a fresh 32K context.

Oolong-Real training curves
Training curves on Oolong-Real. The recursive agent (orange) substantially outperforms the single agent (green) despite the same 32K training context limit.

Average reward across context-length buckets (650-sample evaluation, bucketed 55K–175K). Steps and time on the common non-zero-score intersection.

Method Avg. 55K 118K 175K Steps Time (s)
Single 0.2030.3510.1830.129 7.112.6
Recursive 0.3200.4540.3150.249 61.5175.4

Notably, our 30B recursive model approaches the Oolong performance of much larger frontier models — Claude-Sonnet-4 (0.37), o3 (0.37), and GPT-5-mini (0.35).

DeepDive

Qwen-3-4B-Instruct · multi-hop deep-research QA · 75 training steps

DeepDive contains challenging QA pairs constructed by performing controlled walks over knowledge graphs, requiring multi-hop, iterative web searches and synthesis over information scattered across the web. Sub-tasks here are sequentially dependent rather than parallelizable: only 1.6% of delegations were concurrent (vs. 83.9% on TextCraft-Synth).

DeepDive training curves
DeepDive training curves. The recursive agent learns substantially faster than the single agent; the gap continues to widen with more training.
DeepDive evaluation on 50 held-out tasks.
MethodSRStepsTime (s)
Single 0.245.213.3
Recursive 0.40121.2233.0

On tasks solved by both methods the average max depth was 2.9; on tasks uniquely solved by the recursive agent it was 4 — the agent learns to allocate more test-time compute to harder questions.

Recursive agents adapt delegation depth to task difficulty

Maximum delegation depth on TextCraft
Maximum delegation depth reached by successful rollouts on TextCraft-Synth. Although training caps depth at 6, the agent learns to scale to greater depths when solving harder problems.

RAO teaches the agent when and how much to delegate. On TextCraft-Synth, the depth distribution of successful rollouts tracks task difficulty closely. On Oolong-Real, almost all successful rollouts use depth 1 — intuitive, since long-context aggregation is best handled by chunking once. On DeepDive, harder tasks elicit deeper delegation. Across all three, the agent learns task-appropriate delegation rather than applying recursion uniformly.

Ablations

Ablation: reward design and depth weighting
Ablation of RAO design choices on TextCraft-Synth medium. Both dense rewards (root + sub-agent) and depth-level inverse-frequency weighting are important for training efficiency.

We ablate the two key design choices in RAO. Dense rewards (using both root and sub-agent task-specific rewards) substantially outperform sparse propagation of root rewards alone. Independently, depth-level inverse-frequency weighting improves over the unweighted variant by preventing deeper, more populated levels of the tree from dominating the gradient.

Key takeaways

Beyond the context window

Recursive agents solve tasks whose inputs exceed the model's training context — by chunking and delegating to sub-agents.

Better training efficiency

Dense, structured sub-agent rewards plus a self-induced curriculum yield faster and stronger learning even when context is unconstrained.

Generalizes to harder tasks

Trained on medium difficulty, recursive agents reach 88% SR on hard tasks — vs. 20% for the single-agent baseline.

Adaptive test-time compute

Recursion depth scales with task difficulty: easy problems stay shallow, hard problems unlock deeper trees.

Faster on parallel tasks

Up to 2.5× wall-clock speedup on hard TextCraft tasks via concurrent asyncio sub-agent launches.

A general principle

Inference-time scaffolds shouldn't just be designed around models — models should be trained to use them.

BibTeX

@article{gandhi2026rao,
  title   = {Recursive Agent Optimization},
  author  = {Gandhi, Apurva and Chakraborty, Satyaki and Wang, Xiangjun
             and Kumar, Aviral and Neubig, Graham},
  journal = {arXiv preprint arXiv:2605.06639},
  year    = {2026}
}