C2C Logo

Cache-to-Cache: Direct Semantic Communication Between Large Language Models

1Tsinghua University, 2Infinigence AI, 3The Chinese University of Hong Kong, 4Shanghai AI Laboratory, 5Shanghai Jiao Tong University
*Equal contribution
C2C Overview

Text-to-Text (T2T) communication loses semantic information and requires slow token-by-token generation. Cache-to-Cache (C2C) enables direct semantic transfer through KV-Cache projection.

Abstract

Multi-LLM systems harness the complementary strengths of diverse Large Language Models, achieving performance and efficiency gains unattainable by a single model. In existing designs, LLMs communicate through text, forcing internal representations to be transformed into output token sequences. This process both loses rich semantic information and incurs token-by-token generation latency. Motivated by these limitations, we ask: Can LLMs communicate beyond text? Oracle experiments show that enriching the KV-Cache semantics can improve response quality without increasing cache size, supporting KV-Cache as an effective medium for inter-model communication. Thus, we propose Cache-to-Cache (C2C), a new paradigm for direct semantic communication between LLMs. C2C uses a neural network to project and fuse the source model's KV-cache with that of the target model to enable direct semantic transfer. A learnable gating mechanism selects the target layers that benefit from cache communication. Compared with text communication, C2C utilizes the deep, specialized semantics from both models, while avoiding explicit intermediate text generation. Experiments show that C2C achieves 8.5-10.5% higher average accuracy than individual models. It further outperforms the text communication paradigm by approximately 3.0-5.0%, while delivering an average 2.0× speedup in latency.

Why Beyond Text?

T2T vs C2C Example

Example: In T2T, ambiguous text fails to convey structural semantics. C2C directly transfers precise semantic understanding without intermediate text generation.

Information Bottleneck

High-dimensional representations compressed into linear text

Ambiguity

Natural language is inherently vague and can be misinterpreted

Latency

Token-by-token generation for every communication

Oracle: KV-Cache as Communication Medium

Two oracle experiments validate that KV-Cache enables effective inter-model communication

Oracle 1: Cache Enrichment

Can enriched KV-Cache improve performance without extending sequence length?

Method Cache Len Accuracy
Direct Direct Setup:
Prefill on question X only
Decode with cache C(X)
→ Baseline without enrichment
|X| 58.42%
Few-shot Few-shot Setup:
Prefill on exemplars E + question X
Decode with full cache C(EX)
→ Longer cache with exemplars
|E|+|X| 63.39%
Oracle Oracle Setup:
Prefill on EX but discard exemplar segment
Keep only question-aligned slice: C*(X)
→ Enriched cache at same length!
|X| 62.34%

✓ Enriched cache improves quality at same length

Oracle 2: Cache Transformation

Can one model's KV-Cache be used by another?

Projection Oracle

✓ Cache is generally convertible between LLMs

Cache-to-Cache Architecture

C2C Architecture

C2C Fuser: Projects and fuses KV-Caches from Sharer into Receiver through (1) Projection, (2) Dynamic Weighting, and (3) Learnable Gating modules.

Projection Module

Concatenates and processes KV-Caches from both models

Dynamic Weighting

Input-aware modulation for adaptive information flow

Learnable Gating

Per-layer gates select which layers benefit from fusion

Training: Freeze both models, train only C2C module with next-token prediction loss

Results

Performance Comparison

Select model combinations to explore C2C performance across all benchmarks

Sharer Model

The model that provides contextual knowledge through its KV-Cache representations

Receiver Model

The model that receives and utilizes knowledge from the Sharer via KV-Cache projection

Loading results...

Performance Summary for Selected Model Pair

Avg C2C vs Receiver
Avg C2C vs T2T
Avg Speedup vs T2T
Receiver Sharer Benchmark Receiver Only Sharer Only Routing Text-to-Text Cache-to-Cache

Generalization:

  • Different families: Works across Qwen, Llama, Gemma models
  • Different specializations: General, code-focused, and math-focused models
  • Long contexts: Consistent improvements across 0-4k, 4-8k, and 8k+ token ranges
  • Bidirectional: Works when swapping Sharer and Receiver roles

Key Contributions

Novel Paradigm

First direct semantic communication between LLMs beyond text

Effective

8.5-10.5% higher accuracy than individual models; 3.0-5.0% better than text communication; 2× faster

Flexible

Works across models, families, sizes, and specializations

BibTeX

@article{fu2025c2c,
      title={Cache-to-Cache: Direct Semantic Communication Between Large Language Models}, 
      author={Tianyu Fu and Zihan Min and Hanling Zhang and Jichao Yan and Guohao Dai and Wanli Ouyang and Yu Wang},
      journal={arXiv preprint arXiv:2510.03215},
      year={2025},
  }