Core idea of R2R
The core idea of Roads to Rome (R2R) is to selectively use a large language model (LLM) only for critical, path-divergent tokens, while relying on a more efficient small language model (SLM) for the majority of token generation. Essentially, R2R tries to let the SLM follows LLM reasoning path by correcting only the divergent tokens.
Path-Following Routing Strategy
To achieve this, R2R employs a path-following routing strategy. At each step, it compares SLM and LLM next-token predictions. If identical, the SLM token is used. If different, a continuation-and-verification mechanism determines if the difference is 'neutral' (not affecting the reasoning outcome) or 'divergent' (altering the reasoning path). Divergent tokens are routed to the LLM for correction, ensuring the generation stays aligned with the LLM's intended path. This is formalized by checking if an LLM-based continuation from the SLM's differing token maintains quality compared to an LLM-based continuation from the LLM's token.

R2R's data labeling pipeline: LLM generates a response. SLM prefills to find different tokens. LLM continues from these points. A verifier labels differences as neutral or divergent.
Router Training and Routing Scheme
The path-following routing strategy generates a large amount of model preference labels for training the router. The router, a small feed-forward network, learns to predict divergence based on SLM logits, token embeddings, and last-layer hidden states, enabling immediate routing decisions during inference.

R2R uses neural router to inspect SLM outputs at each step, immediately corrects divergent tokens with LLM, then continues generation from the corrected outputs.