Cache-to-Cache

Direct semantic communication between LLMs through KV-Cache projection and fusion — faster and more accurate than text-based communication.

+10.5%
Accuracy vs Individual
+5.0%
Accuracy vs Text-to-Text
2.0×
Speedup
Read Paper View Code 🤗 Models ICLR 2026

Overview

LLMs communicating beyond text

Multi-LLM systems combine diverse models to leverage their complementary strengths. But current text-based communication creates bottlenecks.

Text-to-Text (T2T) loses rich semantic information and requires slow token-by-token generation
Cache-to-Cache (C2C) projects and fuses KV-Caches directly — preserving semantics with 2× speedup
C2C Overview

Motivation

Why communicate beyond text?

Text-based communication between LLMs has fundamental limitations

Information Bottleneck

High-dimensional semantic representations are compressed into linear text sequences

Ambiguity

Natural language is inherently vague and can be misinterpreted by the receiver

Latency

Slow token-by-token generation is required for every communication exchange

T2T vs C2C Example

Example: Text-to-Text communication loses structural semantics through ambiguous descriptions. Cache-to-Cache transfers precise semantic understanding directly.

The Discovery

KV-Cache carries semantic meaning.

Oracle experiments reveal three key findings that motivate Cache-to-Cache communication.

Beneficial
Cache Enrichment Works
Enriching KV-Cache semantics achieves near few-shot performance (62.3% vs 63.4%) without extending sequence length.
Convertible
Cache Transfers Across Models
KV-Cache from one model can be projected into another model's representation space via simple transformation.
Complementary
Models Have Distinct Strengths
Different LLMs encode distinct semantic understandings — their correct-answer sets show limited overlap despite similar accuracy.
Learn more about our oracle experiments

Oracle 1: Cache Enrichment

Question: Can enriched KV-Cache improve response quality without increasing cache size?

Few-shot prompting improves accuracy — but is it from attending to more tokens, or from enriching how the question is embedded in KV-Cache? We test this by prefilling on exemplars E⊕X, then discarding E and keeping only the question-aligned cache slice.

  • Direct: 58.42% accuracy (cache length |X|)
  • Few-shot: 63.39% accuracy (cache length |E| + |X|)
  • Oracle: 62.34% accuracy (cache length |X| only)

Finding: The gain comes from richer question embeddings, not from attending to extra tokens. KV-Cache can carry enriched semantics.

Oracle 2: Cache Transformation

Question: Can one model's KV-Cache be utilized by another model?

We trained a 3-layer MLP to map KV-Cache from Qwen3-4B (source) to Qwen3-0.6B (target). T-SNE visualization shows the transformed cache moves into the target model's representation space.

t-SNE visualization of cache transformation

Finding: The transformed cache occupies a subset of the target's space — different models encode distinct semantic understandings. This suggests fusing specialized contextual understanding from different models could harness their complementary strengths.

Architecture

Cache-to-Cache Fuser

C2C Architecture

C2C Fuser: Projects and fuses KV-Caches from Sharer into Receiver through (1) Projection, (2) Dynamic Weighting, and (3) Learnable Gating modules.

Projection Module

Concatenates and processes KV-Caches from both models through projection and feature fusion layers

Dynamic Weighting

Input-aware head modulation layer for adaptive information flow between models

Learnable Gating

Per-layer trainable gates that select which layers benefit from cache fusion

Training: Freeze both models, train only C2C module with next-token prediction loss

Experiments

Benchmark Results

Evaluate C2C performance across different model combinations

C2C vs Receiver
C2C vs Text-to-Text
Speedup vs T2T
View Detailed Results
Loading results...
Receiver Sharer Benchmark Receiver Only Sharer Only Routing Text-to-Text Cache-to-Cache

Summary

Key Contributions

Novel Paradigm

First direct semantic communication between LLMs beyond text-based interfaces

Effective & Efficient

8.5-10.5% higher accuracy; 3.0-5.0% better than text communication; 2× faster

Flexible & General

Works across model families, sizes, and specializations

Team

Meet the researchers.

1Tsinghua University    2Infinigence AI    3The Chinese University of Hong Kong
4Shanghai AI Laboratory    5Shanghai Jiao Tong University
*Equal contribution    †Corresponding author

Citation

Cite our work.

If you find C2C useful for your research, please consider citing our paper.

BibTeX
@article{fu2025c2c,
  title={Cache-to-Cache: Direct Semantic Communication Between Large Language Models},
  author={Fu, Tianyu and Min, Zihan and Zhang, Hanling and Yan, Jichao and Dai, Guohao and Ouyang, Wanli and Wang, Yu},
  journal={arXiv preprint arXiv:2510.03215},
  year={2025}
}