Cache-to-Cache: Direct Semantic Communication Between Large Language Models

Abstract

Multi-LLM systems harness the complementary strengths of diverse Large Language Models, achieving performance and efficiency gains unattainable by a single model. In existing designs, LLMs communicate through text, forcing internal representations to be transformed into output token sequences. This process both loses rich semantic information and incurs token-by-token generation latency. Motivated by these limitations, we ask: Can LLMs communicate beyond text? Oracle experiments show that enriching the KV-Cache semantics can improve response quality without increasing cache size, supporting KV-Cache as an effective medium for inter-model communication. Thus, we propose Cache-to-Cache (C2C), a new paradigm for direct semantic communication between LLMs. C2C uses a neural network to project and fuse the source model's KV-cache with that of the target model to enable direct semantic transfer. A learnable gating mechanism selects the target layers that benefit from cache communication. Compared with text communication, C2C utilizes the deep, specialized semantics from both models, while avoiding explicit intermediate text generation. Experiments show that C2C achieves 8.5-10.5% higher average accuracy than individual models. It further outperforms the text communication paradigm by approximately 3.0-5.0%, while delivering an average 2.0× speedup in latency.

Why Beyond Text?

Example: In T2T, ambiguous text fails to convey structural semantics. C2C directly transfers precise semantic understanding without intermediate text generation.

Information Bottleneck

High-dimensional representations compressed into linear text

Ambiguity

Natural language is inherently vague and can be misinterpreted

Latency

Token-by-token generation for every communication

Oracle: KV-Cache as Communication Medium

Two oracle experiments validate that KV-Cache enables effective inter-model communication

Oracle 1: Cache Enrichment

Can enriched KV-Cache improve performance without extending sequence length?

Method	Cache Len	Accuracy
Direct Direct Setup: Prefill on question X only Decode with cache C(X) → Baseline without enrichment	\|X\|	58.42%
Few-shot Few-shot Setup: Prefill on exemplars E + question X Decode with full cache C(E ⊕ X) → Longer cache with exemplars	\|E\|+\|X\|	63.39%
Oracle Oracle Setup: Prefill on E ⊕ X but discard exemplar segment Keep only question-aligned slice: C*(X) → Enriched cache at same length!	\|X\|	62.34%

✓ Enriched cache improves quality at same length

Oracle 2: Cache Transformation

Can one model's KV-Cache be used by another?

✓ Cache is generally convertible between LLMs

Cache-to-Cache Architecture

C2C Fuser: Projects and fuses KV-Caches from Sharer into Receiver through (1) Projection, (2) Dynamic Weighting, and (3) Learnable Gating modules.

Projection Module

Concatenates and processes KV-Caches from both models

Dynamic Weighting

Input-aware modulation for adaptive information flow

Learnable Gating

Per-layer gates select which layers benefit from fusion

Training: Freeze both models, train only C2C module with next-token prediction loss

Results

Performance Comparison

Select model combinations to explore C2C performance across all benchmarks

Performance Summary for Selected Model Pair

—

Avg C2C vs Receiver

—

Avg C2C vs T2T

—

Avg Speedup vs T2T

Receiver	Sharer	Benchmark	Receiver Only	Sharer Only	Routing	Text-to-Text	Cache-to-Cache

Generalization:

Different families: Works across Qwen, Llama, Gemma models
Different specializations: General, code-focused, and math-focused models
Long contexts: Consistent improvements across 0-4k, 4-8k, and 8k+ token ranges
Bidirectional: Works when swapping Sharer and Receiver roles

Key Contributions

Novel Paradigm

First direct semantic communication between LLMs beyond text

Effective

8.5-10.5% higher accuracy than individual models; 3.0-5.0% better than text communication; 2× faster

Flexible

Works across models, families, sizes, and specializations

BibTeX

@article{fu2025c2c,
      title={Cache-to-Cache: Direct Semantic Communication Between Large Language Models}, 
      author={Tianyu Fu and Zihan Min and Hanling Zhang and Jichao Yan and Guohao Dai and Wanli Ouyang and Yu Wang},
      journal={arXiv preprint arXiv:2510.03215},
      year={2025},
  }