thu-nics/TaH

Overview

Iterate smarter, not more

Looped transformers refine token predictions through multiple latent iterations — but iterating on every token wastes compute and risks latent overthinking, where correct predictions get flipped to errors.

Always-Iterate applies uniform iterations to all tokens — overthinking easy tokens while still underthinking hard ones

Think-at-Hard selectively iterates only on hard tokens — improving accuracy while reducing computation by 93%

TaH Overview - Latent iterations can fix wrong predictions but also overthink correct ones

The Discovery

Most tokens don't need to think twice.

We identify latent overthinking — where extra iterations hurt rather than help — and show that selective iteration unlocks significant untapped potential.

8.7%

Tokens Corrected

The second iteration fixes 8.7% of first-pass mispredictions, showing latent iterations can genuinely help on hard tokens.

2.1%

Tokens Overthought

But it also flips 2.1% of correct predictions into errors — a latent overthinking phenomenon that mirrors explicit CoT overthinking.

+7.3%

Oracle Potential

An oracle policy that iterates only on mispredicted tokens achieves up to 7.3% improvement — and up to 32% with TaH's optimized architecture.

Learn more about the oracle experiments

Oracle Iteration Policy

We establish an oracle policy π that triggers additional iterations only when the reference LLM mispredicts the target token at the first pass. Using top-1 mismatch as the discrepancy metric, this oracle iterates on only 12–19% of tokens.

Policy	NTP	AMC23	MMLU100	HE++
Always-1 (no iteration)	73.1	38.1	56.0	39.6
Always-2 (iterate all)	79.7	40.6	60.0	40.9
Oracle (selective)	81.8 +2.1	47.9 +7.3	62.0 +2.0	43.3 +2.4
Oracle w. TaH	89.3 +9.6	68.8 +28.2	85.0 +25.0	72.9 +32.0

Key insight: TaH's architecture better utilizes the oracle policy, achieving >25% improvement and surpassing even Qwen3-4B. The oracle requires ground-truth tokens unavailable at inference — TaH approximates it with a neural decider.

Accuracy landscape across iterations

Because Ouro trains all iterations to predict all tokens, predictable tokens across depths largely overlap, leaving more tokens unpredictable by any iteration. TaH instead specializes deeper iterations for hard tokens, reducing overlap and improving coverage under selective iteration.

Architecture

TaH Design

Three architectural innovations enable efficient selective latent iteration.

TaH Overview. (a) Regular causal attention. (b) Duo-causal attention extends causality to two dimensions. (c) TaH selectively iterates or verbalizes tokens using LoRA adapters and a neural iteration decider.

Duo-Causal Attention

2D causality across token positions and iteration depths — compatible with FlashAttention, no custom CUDA kernels needed

Depth-aware LoRA

LoRA adapters at $d > 1$ shift the objective from next-token prediction to hard-token refinement with <3% extra parameters

Neural Iteration Decider

Lightweight MLP that predicts which tokens need deeper thinking — trained to imitate the oracle policy in a stable two-stage scheme

1

Input & Forward Pass

2

Duo-Causal Attention

3

Depth Adapter (LoRA)

4

Iteration Decider

Standard Forward Pass (Depth 1)

Token embeddings enter the LLM backbone for a standard forward pass at depth $d=1$. The model uses its original pretrained weights $\theta$ without any LoRA adaptation. This first iteration produces standard next-token predictions — correct for ~93% of tokens.

Key: The first pass preserves the pretrained model's strong next-token prediction ability. Only hard tokens proceed to deeper iterations.

Duo-Causal Attention

Unlike standard causal attention (1D: attend to previous positions), duo-causal attention extends causality to two dimensions: tokens attend to both previous positions and shallower iteration depths. Formally: $X_{\le i}^{(\le d)} = \{x_j^{(k)} \mid j \le i, k \le d\}$.

Key: This enables cross-depth information flow — deeper tokens can access shallower representations of all previous tokens — while maintaining full parallel training via FlashAttention compatibility.

Depth Adapter (LoRA)

At deeper iterations ($d > 1$), LoRA adapters activate on top of the shared backbone: $\theta_d = \theta + \Delta$. This shifts the model's objective from general next-token prediction to focused hard-token refinement. Residual connections across iterations simplify the refinement process.

Key: LoRA adds less than 3% extra parameters while enabling the model to specialize deeper iterations. Without LoRA, the shared weights must handle both objectives, limiting performance.

Neural Iteration Decider

A lightweight MLP ($\mathcal{I}_\phi$) reads concatenated hidden states from shallow, middle, and final LLM layers to predict a continuation probability $\hat{c}_i^{(d)} \in [0,1]$. If $\hat{c}_i^{(d)} < c_{\text{threshold}}$, the token verbalizes; otherwise it continues to the next iteration depth.

Key: The decider is trained in Stage 2 to imitate the oracle policy via weighted binary cross-entropy. It achieves ~83% accuracy at predicting the oracle's decisions. Tokens like "But" (34%) and "So" (18%) are most frequently selected for deeper iteration.

Two-stage training scheme

The backbone LLM and iteration decider are tightly coupled — joint training is unstable. We decouple them with a two-stage approach under a fixed oracle policy $\pi$:

Stage 1: Backbone Training

Fine-tune the LLM ($\theta$) and LoRA ($\Delta$) with $\pi$-guided iteration. Loss = standard next-token prediction at the oracle-determined depth. This preserves first-iteration accuracy for easy tokens while training deeper iterations to refine hard ones.

Stage 2: Decider Training

Freeze the backbone and train the decider ($\phi$) to imitate $\pi$'s continuation decisions via weighted binary cross-entropy. Class weights handle the label imbalance between continue (~7%) and stop (~93%) decisions.

Experiments

Benchmark Results

Consistent gains across nine reasoning benchmarks at three model scales.

TaH+ Performance Summary

+5.3%

vs Standard

35.2%

Average Accuracy

Method	AIME25	Olympiad	AMC23	MATH500	GSM8K	GPQA	MMLU	HE++	MBPP++	Average
Standard	1.9	15.4	22.7	39.9	58.2	31.1	54.2	16.8	28.8	29.9
SoftThink	2.9	14.0	22.2	39.6	55.9	24.7	53.0	14.3	29.5	28.5
Ouro	2.1	14.2	19.7	37.4	56.6	35.4	54.0	18.9	23.5	29.1
AlwaysThink	1.3	12.6	21.9	37.8	52.6	30.8	51.4	9.1	13.8	25.7
TaH	2.1	19.1	24.1	46.2	63.6	29.0	56.4	21.6	33.9	32.9
TaH+	4.6	20.6	24.7	51.8	67.6	31.3	59.0	22.0	35.1	35.2

Method	AIME25	Olympiad	AMC23	MATH500	GSM8K	GPQA	MMLU	HE++	MBPP++	Average
Standard	10.8	33.8	39.7	67.8	80.2	30.3	74.1	39.0	51.9	47.5
SoftThink	5.4	30.7	40.3	64.8	80.0	33.3	73.5	43.3	49.1	46.7
Ouro	10.8	31.3	40.6	68.2	79.8	32.8	72.4	40.9	45.5	46.9
AlwaysThink	7.5	31.0	40.9	63.2	74.2	30.5	69.6	16.4	25.6	39.9
TaH	13.8	37.2	40.9	71.4	84.8	33.3	74.8	50.0	55.3	51.3
TaH+	15.4	37.6	48.4	72.6	84.5	39.4	76.6	51.5	57.5	53.7

Method	AIME25	Olympiad	AMC23	MATH500	GSM8K	GPQA	MMLU	HE++	MBPP++	Average
Standard	24.2	50.4	64.2	84.2	91.7	46.2	85.4	69.2	65.9	64.6
SoftThink	25.8	50.2	63.1	85.0	92.5	50.3	86.0	69.2	66.7	65.4
Ouro	25.0	51.4	64.4	83.8	90.7	50.0	85.9	70.7	66.7	65.4
TaH	27.1	50.5	69.7	85.8	91.0	48.5	86.2	70.1	67.7	66.3
TaH+	27.9	52.6	68.1	85.6	91.7	49.0	86.6	72.0	68.1	66.8

Real-world efficiency on A800 GPU

1.7B model, AIME25, 8K max token length on a single NVIDIA A800 GPU:

	Standard	AlwaysThink	TaH
Avg. Depth	1.00	2.00	1.06
Memory (GB)	4.3	6.8	4.6
Latency (s)	210.6	747.2	301.4
Throughput (tok/s)	38.9	11.0	27.2

TaH iterates twice on only 6% of tokens, achieving 1.48× lower memory overhead and 2.48× faster decoding than AlwaysThink.

Contributions

Key Contributions

Selective Latent Iteration

First to identify latent overthinking in looped transformers and propose selective iteration as a new design principle — iterate only on hard tokens for better quality and efficiency.

Specialized Architecture

Duo-causal attention, depth-aware LoRA, and neural iteration decider — three components that natively support selective iteration with full training parallelism.

Efficient & Effective

+5.3–6.2% accuracy gains across nine benchmarks with <3% extra parameters and only 1.07× average iteration depth — 93% of tokens need just one pass.

Team

Meet the researchers.

Tianyu Fu^1,2* Yichen You^1* Zekai Chen¹ Guohao Dai^3,2 Huazhong Yang¹ Yu Wang¹

¹Tsinghua University ²Infinigence AI ³Shanghai Jiao Tong University
*Equal contribution

Citation

Cite our work.

If you find TaH useful for your research, please consider citing our paper.

BibTeX

@inproceedings{fu2026think,
  title={Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models},
  author={Fu, Tianyu and You, Yichen and Chen, Zekai and Dai, Guohao and Yang, Huazhong and Wang, Yu},
  booktitle={Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)},
  series={Proceedings of Machine Learning Research},
  volume={306},
  address={Seoul, South Korea},
  note={Best Paper Shortlist, ICLR 2025 LIT Workshop},
  year={2026}
}