Roads to Rome

Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing. Let small models follow large models' reasoning paths by correcting only the divergent tokens.

2.8×
Speedup vs R1-32B
5.6B
Avg. Activated Params
1.6×
Accuracy vs R1-7B
Read Paper View Code 🤗 Models NeurIPS 2025

The Challenge

LLMs are powerful but expensive.

Small models are fast but make mistakes. What if we could get the best of both worlds?

🚀

Small Models (1.5B)

Fast generation
Limited accuracy

R2R Routing

Best of both worlds
Smart token routing

🧠

Large Models (32B)

Accurate reasoning
Slow & expensive

The Discovery

Not all tokens are equal.

We discovered that only a small fraction of tokens actually diverge reasoning paths between large and small models.

89%
Identical Tokens
The vast majority of tokens are predicted identically by both models. No routing needed.
6%
Neutral Differences
Minor variations like "let's" vs "let us" that don't affect reasoning outcomes.
5%
Divergent Tokens
Critical tokens that change the reasoning path. These are the only ones we route to the LLM.
Learn more about token classification

Identical tokens occur when both the small and large language model predict the exact same next token given the same context. Since they agree, we can safely use the faster small model.

Neutral differences are cases where models predict different tokens, but these differences don't affect the reasoning outcome—like abbreviations or stylistic choices.

Divergent tokens genuinely alter the meaning, logic, or conclusion of the current sentence, thus diverging the subsequent reasoning path. These are the critical tokens where we need the large model's guidance.

Interactive Demo

Watch R2R in action.

See how R2R intelligently routes tokens between models in real-time.

R2R Inference Workflow
Step 1
Input Query
Step 2
SLM Predicts
Step 3
Router Checks
Step 4
Route Decision
Step 5
Output Token
AIME 2024 Problem I-1
Every morning Aya goes for a 9-kilometer-long walk and stops at a coffee shop afterwards. When she walks at a constant speed of s kilometers per hour, the entire walk takes exactly 4 hours, including t minutes spent at the coffee shop. When she walks at a constant speed of s + 2 kilometers per hour, the entire walk takes exactly 2 hours and 24 minutes, including t minutes spent at the coffee shop. Suppose Aya walks at s + ½ kilometers per hour. Find the number of minutes the entire walk takes, including the t minutes spent at the coffee shop.
Click the play button above to see R2R process this problem
Small Model (R1-1.5B) generates tokens...
🔍
Neural Router Analyzing Token
The 56M-parameter neural router examines three features from the SLM: top-100 logits (prediction uncertainty), token embeddings (frequency bias), and hidden states (semantic context) to predict if this token would diverge the reasoning path.
Accept SLM Token
Identical or neutral
Route to LLM
Divergent token detected
Generated Response with R2R Routing
SLM Token (Identical)
SLM Token (Neutral)
LLM Token (Divergent)

Hover over blue or red tokens to see why the router made that decision

Watch Video Demo

Side-by-side comparison: R1-32B (left) vs R2R (right). Red tokens are routed to LLM.

How It Works

Intelligent token routing.

R2R uses a lightweight neural router trained on automatically generated token-level labels to decide which tokens need the large model's attention.

R2R Method Overview
Automatic Data Generation Pipeline

A key contribution of R2R is an automatic pipeline for generating token-level routing labels without expensive human annotation:

  1. LLM generates a complete response to the problem
  2. SLM prefills the LLM's response to identify tokens where predictions differ
  3. LLM continues from each differing token to produce alternative continuations
  4. Verifier model compares continuations to label differences as neutral or divergent

This pipeline produces large-scale training data automatically, enabling the router to learn which token differences actually matter for reasoning.

Data Generation Pipeline
Path-Following Routing Strategy

At each generation step, R2R compares the next-token predictions from both the SLM and LLM. If predictions are identical, the efficient SLM token is used. When they differ, a continuation-and-verification mechanism determines whether the difference is:

  • Neutral: Doesn't affect reasoning outcomes (use SLM)
  • Divergent: Alters the reasoning path (route to LLM)

The core insight is that the SLM can follow the LLM's reasoning path if we correct only the divergent tokens—the SLM then continues from the corrected token without further intervention until the next divergent token.

Neural Router Architecture

The router is a lightweight 56M-parameter feed-forward network that takes three inputs from the SLM:

  • Top-100 SLM logits: Captures prediction uncertainty (higher entropy indicates potential divergence)
  • Token embeddings: Encodes token frequency biases (rare tokens are more likely to diverge)
  • Last-layer hidden states: Provides semantic context for the current generation step

The router outputs a binary classification probability indicating whether the current token diverges from the LLM's reasoning path. This enables immediate routing decisions during inference without waiting for continuation verification.

Predictive Indicators of Divergence
Entropy Distribution

SLM Entropy

Divergent tokens show 3.8× higher entropy in SLM output logits, indicating prediction uncertainty.

Token Frequency

Token Frequency

Low-frequency tokens are more likely to be divergent due to limited SLM training on rare tokens.

Performance

State-of-the-art efficiency.

R2R advances the Pareto frontier of test-time scaling efficiency across AIME, GPQA, and LiveCodeBench.

R1-7B (Baseline)
28%
Average Accuracy
7.0B params
154K cost*
R1-14B
43%
Average Accuracy
14B params
234K cost*
R2R (Ours)
46%
Average Accuracy
5.6B avg params
103K cost*
R1-32B (LLM)
50%
Average Accuracy
32B params
537K cost*

*Cost = output tokens × activated parameters (in billions). Lower is better.

Pareto Frontier Chart

R2R consistently advances the Pareto frontier across AIME, GPQA, and LiveCodeBench benchmarks.

Detailed Benchmark Results
Method AIME LiveCodeBench GPQA Avg. Params
R1-1.5B (SLM) 12% 9% 8% 1.5B
R1-7B 32% 24% 29% 7.0B
R1-14B 48% 38% 44% 14.0B
R2R (Ours) 55% 39% 44% 5.6B
R1-32B (LLM) 57% 45% 46% 32.0B
Routing Behavior Analysis
Routing Behavior

R2R naturally learns efficient routing patterns: fewer LLM calls during straightforward reply phases, and more at the beginning and end of each thought where reasoning direction is determined.

Team

Meet the researchers.

1Tsinghua University    2Infinigence AI    3Shanghai Jiao Tong University
*Equal contribution

Citation

Cite our work.

If you find R2R useful for your research, please consider citing our paper.

BibTeX
@article{fu2025r2r,
          title={R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing},
          author={Fu, Tianyu and Ge, Yi and You, Yichen and Liu, Enshu and Yuan, Zhihang and Dai, Guohao and Yan, Shengen and Yang, Huazhong and Wang, Yu},
          journal={arXiv preprint arXiv:2505.21600},
          year={2025}
        }