Latent Reasoning · Language Models

Tyler: Typed Latent Reasoning
for Language Models

When to think, what to compute, and how much to allocate.

Hanyu Lin1 · Min Cai2 · Jiawei Wen1 · Haodi Zhang1
1Shenzhen University  ·  2University of Alberta
+14.5
points over CoT
+4.3
over best baseline
3
typed operators
2.3
forgetting score
Comparison of latent reasoning paradigms
Prior methods rely on fixed-position latent tokens or external triggers. Tyler interleaves visible decoding with typed latent-operator calls during generation.
The idea

Reasoning that stays in latent space — on demand.

Chain-of-thought makes a model write out every intermediate step as text. That's redundant and slow, and it forces a rigid interface: the model must spell out its reasoning before it can answer, instead of computing internally when needed.

Tyler relaxes that interface. At each decoding step the model decides whether to emit a visible token or quietly switch to a typed latent operator that carries part of the computation in continuous representations. Prior latent-reasoning methods mostly focus on how to build latent tokens; Tyler instead reframes the problem as one of online control — when to think, what kind of computation to run, and how much budget to spend.

No fixed positions, no external trigger model. The choice between speaking and silently thinking competes inside a single next-action distribution, so the LLM itself stays in control.

“Can an LLM learn to interleave visible decoding with silent latent computation, while deciding when to think, what type of computation to perform, and how much budget to spend?”
— The research question motivating Tyler
Three typed operators

Different steps do different jobs.

Reasoning chains are functionally heterogeneous: some steps orient the problem, some update the evolving state, others reuse a procedure. Collapsing all of this into one undifferentiated latent vector throws that structure away. Tyler keeps it, with three operators whose type supplies an inductive bias while their behavior is learned from chain-of-thought data.

Og

Global orientation

Frames the problem and plans the approach — invoked at the start of generation.

Os

State update

Maintains local intermediate state, most active in early and middle reasoning stages.

Op

Procedural abstraction

Reuses solution patterns — appears more often as reasoning approaches the answer.

Tyler architecture and decoding flow
At each step the policy emits a token or fires an operator, which synthesizes operator-conditioned latent tokens that guide subsequent decoding.

When an operator fires, a shared LoRA-augmented synthesizer turns the current context into latent tokens — the backbone weights stay frozen, so this is parameter-efficient. The operators share that synthesizer but specialize through their own query tokens and projection heads, which is what lets them transfer across reasoning domains rather than memorizing a task-specific output format.

Two-stage training

First teach the operators, then learn when to use them.

Separating what each operator computes from when it should be called keeps training stable: Stage 1 gives the operators a clear functional identity, and Stage 2 only has to learn a sparse routing policy on top of a fixed latent interface. Both stages train on just 15K math problem–solution traces from OpenR1-Math.

Stage 1

Align the operators

The latent operators and synthesis module are trained with heuristic position supervision while the backbone LLM stays frozen, anchoring each operator to its intended role.

Stage 2

Learn the policy

A budget-aware invocation policy is optimized with GRPO. An operator-anchored auxiliary loss stabilizes learning at sparse invocation points.

Results

Consistent gains that transfer across domains.

Trained only on math traces, Tyler still improves scientific and theorem-oriented reasoning — and tops every baseline on average across two backbones.

MethodGSM8KMATH-500GPQA-DTheoremQAAverage
CoT58.968.019.719.541.5
MemGen (best baseline)84.270.225.327.351.7
Tyler85.276.433.529.056.0 ↑14.5

Pass@1 accuracy (%) on SmolLM3-3B. Tyler also reaches 65.8 average on Qwen3-4B, +9.2 over CoT.

The gains hold up where baselines wobble. SFT helps the smaller backbone but slightly hurts the larger one; GRPO and MemGen each win on one model and not the other. Tyler is the only method that lands the best average on both backbones — and the largest jumps come on the out-of-domain benchmarks it was never trained on, improving GPQA-Diamond over CoT by up to 27 points.

It also adapts well over time. Trained sequentially across code, science, math, and theorem-oriented tasks, Tyler finishes with the highest macro-average accuracy (39.8%, about +3.5 over the strongest baseline) while forgetting the least — a forgetting score of just 2.3, roughly 1.4 points better than MemGen.

Continual learning results
Highest final accuracy with the lowest forgetting under sequential domain adaptation.
Latent budget scaling
Accuracy climbs sharply up to a budget of 5 latent calls, then plateaus.

The operators really specialize

Their latent representations form separate clusters, and removing each one produces a different failure mode — drop Os and arithmetic/state errors rise; drop Op and procedural errors appear.

Routing is task-adaptive

Harder problems trigger latent operators more often, and the policy follows a stage-aware rhythm: state updates dominate early, procedural abstractions take over near the answer.

Operator specialization diagnostics
The three operators form distinct latent clusters, cause distinct error types when removed, and are invoked at distinct reasoning stages.