Roads to Rome (R2R): Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing

Path-Following Routing Strategy Illustration

Example Problem

Every morning Aya goes for a $9$-kilometer-long walk and stops at a coffee shop afterwards. When she walks at a constant speed of $s$ kilometers per hour, the walk takes her 4 hours, including $t$ minutes spent in the coffee shop. When she walks $s+2$ kilometers per hour, the walk takes her 2 hours and 24 minutes, including $t$ minutes spent in the coffee shop. Suppose Aya walks at $s+\frac{1}{2}$ kilometers per hour. Find the number of minutes the walk takes her, including the $t$ minutes spent in the coffee shop.

Routing strategy

Abstract

Large Language Models (LLMs) achieve impressive reasoning capabilities at the cost of substantial inference overhead, posing substantial deployment challenges. Although distilled Small Language Models (SLMs) significantly enhance efficiency, their performance suffers as they fail to follow LLMs' reasoning paths. Luckily, we reveal that only a small fraction of tokens genuinely diverge reasoning paths between LLMs and SLMs. Most generated tokens are either identical or exhibit neutral differences, such as minor variations in abbreviations or expressions.

Leveraging this insight, we introduce Roads to Rome (R2R), a neural token router that selectively utilizes LLMs only for these critical, path-divergent tokens, while leaving the majority of token generation to the SLM. We also develop an automatic data generation pipeline that identifies divergent tokens and generates token-level routing labels to train the lightweight router.

We apply R2R to combine R1-1.5B and R1-32B models from the DeepSeek family, and evaluate on challenging math, coding, and QA benchmarks. With an average activated parameter size of 5.6B, R2R surpasses the average accuracy of R1-7B by 1.6x, outperforming even the R1-14B model. Compared to R1-32B, it delivers a 2.8x wall-clock speedup with comparable performance, advancing the Pareto frontier of test-time scaling efficiency.

Method Overview

Core idea of R2R

The core idea of Roads to Rome (R2R) is to selectively use a large language model (LLM) only for critical, path-divergent tokens, while relying on a more efficient small language model (SLM) for the majority of token generation. Essentially, R2R tries to let the SLM follows LLM reasoning path by correcting only the divergent tokens.

Path-Following Routing Strategy

To achieve this, R2R employs a path-following routing strategy. At each step, it compares SLM and LLM next-token predictions. If identical, the SLM token is used. If different, a continuation-and-verification mechanism determines if the difference is 'neutral' (not affecting the reasoning outcome) or 'divergent' (altering the reasoning path). Divergent tokens are routed to the LLM for correction, ensuring the generation stays aligned with the LLM's intended path. This is formalized by checking if an LLM-based continuation from the SLM's differing token maintains quality compared to an LLM-based continuation from the LLM's token.

R2R's data labeling pipeline: LLM generates a response. SLM prefills to find different tokens. LLM continues from these points. A verifier labels differences as neutral or divergent.

Router Training and Routing Scheme

The path-following routing strategy generates a large amount of model preference labels for training the router. The router, a small feed-forward network, learns to predict divergence based on SLM logits, token embeddings, and last-layer hidden states, enabling immediate routing decisions during inference.

R2R uses neural router to inspect SLM outputs at each step, immediately corrects divergent tokens with LLM, then continues generation from the corrected outputs.

Key Observations & Design

Observation 1:
Token Divergence Rarity

Distribution of identical, neutral, and divergent tokens

Only a small fraction of tokens genuinely diverge reasoning paths between LLMs and SLMs. Most are identical or neutral.

Design Choice: Selectively use LLMs only for critical, path-divergent tokens. SLM handles the majority.

Observation 2:
SLM Entropy as Indicator

SLM entropy distribution for divergent vs. non-divergent tokens

Divergent tokens exhibit substantially higher entropy in the SLM's output logits.

Design Choice: Router utilizes top SLM logit values as an input feature to predict divergence.

Observation 3:
Token Frequency Matters

Low-frequency tokens are more likely to be divergent due to limited SLM capacity for rare tokens.

Design Choice: Router incorporates token embeddings as input, capturing token-frequency biases.

Experiment Results

Performance and Efficiency Comparisons

Method	Accuracy	Avg. Param.	Cost (K*B)
R1-1.5B (SLM)	10%	1.5B	42
R1-7B	28%	7.0B	154
R1-14B	43%	14.0B	234
R1-32B (LLM)	50%	32.0B	537
R2R (Ours)	46%	5.6B	103

By combining the R1-1.5B SLM and the R1-32B LLM, R2R achieves impressive efficiency with only 5.6B average activated parameters.

Compared to the SLM baseline, R2R delivers 4.6$\times$ higher accuracy than R1-1.5B.
When compared to mid-sized models, R2R achieves 1.5$\times$ speedup with 1.1$\times$ accuracy over R1-14B.
Against the full R1-32B LLM, it delivers a 2.8$\times$ end-to-end inference speedup while still reaching 92% of the LLM's accuracy—all while using the LLM for only 11%–15% of generated tokens.

Scaling Behavior

Accuracy vs. Average Activated Parameters per Token

R2R advances the Pareto frontier for accuracy versus average activated parameters, outperforming distillation and query-level routing methods.

These results demonstrate that R2R effectively balances performance and efficiency, offering substantial computational savings while maintaining high-quality outputs across challenging benchmarks.

Visualizing R2R Routing Behavior

(a) LLM usage across the entire thinking and reply process. (b) LLM usage within each thought segment. R2R routes fewer tokens to LLM during replies and relies more on LLM at the beginning/end of thoughts.

Analysis of R2R routing behavior on the AIME benchmark reveals several key insights:

R2R routes fewer tokens to the LLM during the reply phase compared to the thinking process. It reflects that after deep reasoning, generating the final answer is relatively straightforward.
Within individual thought segments, R2R relies more heavily on the LLM at the beginning and end of each thought, where tokens determine reasoning direction and whether to continue, branch, or conclude.
These efficient routing patterns emerge naturally from the router training process without explicit rules, highlighting R2R's ability to learn efficient resource allocation.

Mix Inference Demo

See R2R (right) run in action, along with R1-32B (left)

R2R in action: red tokens are routed to the LLM, while blue tokens are generated by the SLM, demonstrating efficient token-level routing.

BibTeX

We welcome you to explore our repository and cite our paper if you find the results interesting or useful for your research.

@article{fu2025r2r,
      title={R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing}, 
      author={Tianyu Fu and Yi Ge and Yichen You and Enshu Liu and Zhihang Yuan and Guohao Dai and Shengen Yan and Huazhong Yang and Yu Wang},
      journal={arXiv preprint arXiv:2505.21600},
      year={2025},
}