TL;DR: Tiny Recursive Models are effectively running a truncated form of policy iteration on each puzzle. Their “mysterious” reasoning power is just good old Bellman math in disguise.

12/11/25: 📢 PAPER ON THE WAY

This post is only the starting point. There are many directions to push this RL view of TRM, and I plan to follow up with more experiments and theory — stay tuned.

TRM as Implicit Reinforcement Learning: One Coupled Operator to Rule Them All

This post is about a particular way to look at Tiny Recursive Models Citation: [1]Less is More: Recursive Reasoning with Tiny Networks
A. Jolicoeur-Martineau, (2025)
Link through the lens of reinforcement learning.

TAKEAWAYS

You can cast TRM as doing implicit policy iteration on each puzzle instance.
The two latent states $z_H$ and $z_L$ line up very naturally with policy and value.
The single shared update operator $f_\theta$ behaves like a coupled Bellman operator in latent space, which explains why sharing it works better than splitting it into two networks.

Along the way this gives a much cleaner story for things like

why there are exactly two levels of recursion
why TRM style sharing helps while HRM Citation: [2]Hierarchical Reasoning Model
G. Wang, J. Li, Y. Sun, X. Chen, C. Liu, Y. Wu, M. Lu, S. Song, Y. Abbasi Yadkori, (2025)
Link style split hurts
why this whole thing is closer to an active world model than a static feedforward network

1. Casting TRM as an MDP

We start by treating each puzzle instance (for example Sudoku or Maze) as a Markov Decision Process (MDP)

\[ \mathcal M = (\mathcal S, \mathcal A, P, R, \gamma), \]

where the agent is the recursive model and the environment is the fixed puzzle specification.

For a given puzzle $x$ we define the pieces as follows.

State

At outer step $t$, high level inner step $h$, and low level inner step $\ell$, the state is

\[ s^{t,h,\ell} = \bigl(x,\; y^{t},\; z_H^{t,h},\; z_L^{t,h,\ell}\bigr) \in \mathcal{S}. \]

Here

$x$: the puzzle input
$y^{t}$: the current prediction or editable workspace
$z_H^{t,h}$: high level latent state
$z_L^{t,h,\ell}$: low level latent state

Both latents live in the same ambient space

\[ z_H^{t,h} \in \mathbb{R}^{n}, \qquad z_L^{t,h,\ell} \in \mathbb{R}^{n}, \]

where $n = (\text{seq_len} + \text{puzzle_emb_len}) \cdot \text{hidden_size}$ is the flattened dimension over tokens and channels.

Action

The model chooses local edits through a policy head on $z_H^{t,h}$:

\[ \pi_\theta(a \mid s^{t,h,\ell}) = \mathrm{softmax}\bigl(\mathrm{head}(z_H^{t,h})\bigr). \]

So the action space $\mathcal A$ is the set of local updates to the workspace $y$
for example “write digit 7 into this Sudoku cell”, “paint this ARC cell blue”, and so on.
Sampling $a \sim \pi_\theta(\cdot \mid s^{t,h,\ell})$ gives the next tentative edit.

Transition

The dynamics are entirely given by the recursive update operators.

Given state $s^{t,h,\ell}$, the next latent states are

\[ z_L^{t,h,\ell+1} = f_\theta\bigl(z_L^{t,h,\ell},\, z_H^{t,h},\, x\bigr), \]\[ z_H^{t,h+1} = f_\theta\bigl(z_H^{t,h},\, z_L^{t,h,\ell+1},\, x\bigr), \]

and the workspace is updated by applying the chosen action

\[ y^{t+1} = \mathrm{Apply}(y^{t}, a). \]

All action information flows through $z_H^{t,h}$:
the policy head reads from $z_H$, and that same $z_H$ is fed into the shared update operator $f_\theta$ which moves both latents forward.

When the low level loop over $\ell$ finishes, control returns to the high level loop over $h$.
When the high level loop finishes, the outer ACT step increments $t$.

Reward and discount

We reinterpret the supervised training signal as a reward

\[ R(s^{t}) = - \mathrm{loss}(y^{t}, y^*), \]

where $y^*$ is the ground truth and $\mathrm{loss}(\cdot,\cdot)$ is task specific
for example cross entropy over the full grid.

TRM uses a finite number of reasoning steps, so it is natural to set

\[ \gamma = 1. \]

This is not an RL algorithm during training, but this MDP view is very useful for understanding the latent dynamics.

2. Latent policy and improvement operator

Now recall classical policy iteration on an MDP with transition $T$ and reward $R$.

Classical policy iteration

At iteration $k$ we have a policy $\pi_k$. Policy evaluation solves

\[ V^{\pi_k}(s) = \mathbb{E}[R(s,\pi_k(s))] + \gamma \sum_{s' \in \mathcal S} T(s,\pi_k(s),s')\, V^{\pi_k}(s') \]

for all states $s$.

Policy improvement then defines a new policy

\[ \pi_{k+1}(s) \in \arg\max_{a \in \mathcal A} \Bigl\{ \mathbb{E}[R(s,a)] + \gamma \sum_{s' \in \mathcal S} T(s,a,s')\, V^{\pi_k}(s') \Bigr\}. \]

The important part is structural: both evaluation and improvement are derived from the same environment $(T, R)$. They are two faces of one Bellman operator.

TRM as implicit policy iteration

Compare this with TRM.

The policy at inner step $(t,h,\ell)$ is

\[ \pi_\theta(a \mid s^{t,h,\ell}) = \mathrm{softmax}\bigl(\mathrm{head}(z_H^{t,h})\bigr), \]

so $z_H^{t,h}$ is a compact latent representation of the current policy.

The low level update is

\[ z_L^{t,h,\ell+1} = f_\theta\bigl(z_L^{t,h,\ell}, z_H^{t,h}, x\bigr), \]

which acts like a Bellman style update to a value representation:
for fixed $z_H$ and puzzle $x$, repeated updates on $z_L$ behave like an instance specific policy evaluation process.

The high level update

\[ z_H^{t,h+1} = f_\theta\bigl(z_H^{t,h}, z_L^{t,h,\ell+1}, x\bigr) \]

then uses the refined value signal in $z_L$ to adjust the policy representation $z_H$.
This is exactly what policy improvement does with $V^{\pi_k}$.

The correspondence is:

$z_H$ plays the role of the current policy $\pi$.
Through the head and softmax, $z_H$ deterministically produces a distribution over actions.
$z_L$ plays the role of a value function $V^\pi$.
For a fixed policy representation $z_H$ and puzzle $x$, the recursion
\[ z_L^{(k+1)} = f_\theta\bigl(z_L^{(k)}, z_H, x\bigr) \]
mimics repeated Bellman backups
\[ V^{k+1} = \mathcal T^\pi V^{k}, \]
with $f_\theta$ implicitly learning the Bellman operator $\mathcal T^\pi$ from data.
The update of $z_H$ from
\[ z_H^{t,h+1} = f_\theta(z_H^{t,h}, z_L^{t,h,\ell+1}, x) \]
plays the role of policy improvement: it takes the value like latent $z_L$ and produces a policy representation whose logits tend to give lower loss.

So even though TRM is trained purely with supervised objectives, its inner dynamics look like a truncated, puzzle specific policy iteration loop in latent space.

3. How TRM differs from classical policy and value iteration

The analogy is useful, but there are important differences.

Instance local evaluation
Classical policy evaluation tries to compute $V(s)$ for all $s \in \mathcal S$ or sweep over the whole state space. TRM only ever evaluates the value like latent for the current puzzle and its trajectory $\{s^{t,h,\ell}\}$. There is no explicit table over $\mathcal S$; value information is recomputed from scratch for each instance.
Parametric Bellman operator
In an MDP we define a Bellman operator $\mathcal T^\pi$ exactly in terms of known $T$ and $R$. In TRM this operator is replaced by a learned network $f_\theta$. It is trained from supervised loss on final predictions, so it absorbs both dynamics and reward patterns implicitly and is only guaranteed to be Bellman like on the distribution of training puzzles and trajectories.
Truncated evaluation
Policy iteration typically evaluates $V^{\pi_k}$ until convergence, or at least several sweeps. TRM runs only a fixed and small number of low level updates on $z_L$ (and high level updates on $z_H$). The value like representation is therefore a truncated approximation rather than a converged fixed point.
Supervised training signal
The reward $R(s^{t}) = -\mathrm{loss}(y^{t},y^*)$ is a reinterpretation of the loss, not the signal used to train the model. TRM does not optimize expected return in the RL sense; it minimizes supervised loss. The RL view is therefore a way to interpret what the inner loop is doing, not a description of the outer optimization.

Despite these mismatches, the structural similarity is strong enough to give a very clean explanation of TRM design choices.

4. Why two levels of recursion?

HRM and TRM both use two nested latent loops.

The original motivation for HRM is partly biological: “two levels of reasoning in the brain”. TRM simplifies the interpretation and suggests

$z_H$ is the current solution
$z_L$ is a latent embedding of the problem

This is more down to earth than “brains” but still mostly intuitive and does not pin the design down to a precise algorithm.

From the RL perspective, the two levels suddenly look very canonical:

The low level recursion on $z_L$ behaves like policy evaluation for the policy represented by $z_H$.
It repeatedly refines a value like representation for the current puzzle and latent state.
The high level recursion on $z_H$ behaves like policy improvement.
It updates the policy representation using the refined value context from $z_L$.

In other words, the two levels correspond exactly to the two main steps of policy iteration. That gives you a concrete, mathematically grounded reason for having two levels and not three or seven.

Adding a third latent level would require a clear interpretation in terms of some known iterative procedure. Without that, the additional depth is just uncontrolled complexity.
By something very close to Occams razor, the two level structure is the simplest one that still matches a well studied algorithmic template.

A key empirical fact is that TRM shares a single update operator $f_\theta$ between high and low level recursions, while HRM uses separate modules for each, and TRM tends to behave better.

Given the RL view, this is exactly what you would expect.

In classical policy iteration you do not have two arbitrary operators. Both policy evaluation and policy improvement are derived from the same environment dynamics $(T,R)$.
There is one underlying Bellman structure.

TRM mirrors that:

low level: $z_L^{\prime} = f_\theta(z_L, z_H, x)$ acts like a value refinement step
high level: $z_H^{\prime} = f_\theta(z_H, z_L', x)$ acts like a policy refinement step

Both are two different slices of a single latent operator $f_\theta$.
They live in the same geometry, share the same Jacobian structure, and are jointly trained to reduce final loss.

If you split them into two separate networks

one for updating $z_L$
one for updating $z_H$

you break that coupling. The “evaluation” dynamics and the “improvement” dynamics can drift apart into incompatible coordinate systems.
The space of possible joint dynamics becomes much larger, which makes it easier for optimization to land in unstable or misaligned behaviors.

The shared $f_\theta$ therefore acts as a strong inductive bias:

you do not learn two arbitrary hacks for “value” and “policy”
you learn a single latent dynamical law whose two projections look like value refinement and policy refinement

This explains why a single operator can outperform two specialised ones. It is not about capacity, it is about enforcing the right coupling.

6. TRM as an active world model

Viewed this way, TRM is very close to an active world model.

The outer supervised training loop teaches $f_\theta$ how puzzles respond to actions in a latent space that is useful for solving them.
At test time, the inner recursion over $(z_H,z_L)$ is effectively a small “agent” that learns, on the fly, a good policy for this particular puzzle by repeatedly applying the learned dynamics.

So you can summarise it like this:

TRM is a world model that is trained offline, but actually learns its policy online at test time by running a latent policy iteration loop over each puzzle.

The RL lens makes that behavior explicit and gives you language to reason about new modifications
for example adding explicit exploration, learning an explicit reward head, changing the number of evaluation steps versus improvement steps, and so on.

7. Where this perspective leads

Once you accept TRM as a latent policy iteration engine, a lot of knobs start to look very natural:

more low level steps $L$ ≈ more accurate policy evaluation
more high level steps $H$ ≈ more aggressive policy improvement
stochasticity in initialization or in actions ≈ exploration in local policy space
regularizers on convergence of $z_L$ and $z_H$ ≈ stability constraints on the learned Bellman operator

And crucially, the design choices in TRM versus HRM are no longer just “what the authors found worked in practice”, but instances of a more general question:

What is the right latent approximation of a Bellman operator for this class of problems?

That is a much nicer place to be in than hand waving about “two levels because brains”.

Reference

References

[1]

Less is More: Recursive Reasoning with Tiny Networks
A. Jolicoeur-Martineau, (2025)
Link

[2]

Hierarchical Reasoning Model
G. Wang, J. Li, Y. Sun, X. Chen, C. Liu, Y. Wu, M. Lu, S. Song, Y. Abbasi Yadkori, (2025)
Link

Cited as:

Benhao Huang. (Dec 2025). Tiny Recursive Model is Secretly Doing Policy Iterations. Husky's Log. posts/recursive_models/


@article{ benhao2025tiny,
  title   = "Tiny Recursive Model is Secretly Doing Policy Iterations",
  author  = "Benhao Huang",
  journal = "Husky's Log",
  year    = "2025",
  month   = "Dec",
  url     = "posts/recursive_models/"
}

TRM as Implicit Reinforcement Learning: One Coupled Operator to Rule Them All#

1. Casting TRM as an MDP#

State#

Action#

Transition#

Reward and discount#

2. Latent policy and improvement operator#

Classical policy iteration#

TRM as implicit policy iteration#

3. How TRM differs from classical policy and value iteration#

4. Why two levels of recursion?#

5. Why sharing the high level and low level update operators helps#

6. TRM as an active world model#

7. Where this perspective leads#

Reference#