KV Cache

KV cache research — key-value cache management, compression, injection, and efficient serving for LLMs.

8 papers

BenchmarkAgent Memory

Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices

Yakov Pyotr Shkolnikov

· 2026

Agent Memory Below the Prompt stores each agent’s KV state in a block pool, quantizes it via a Q4 pipeline, reloads it with BatchQuantizedKVCache, and reuses it across phases using cross-phase context injection. On Gemma 3 12B, Agent Memory Below the Prompt reduces cold TTFT from 172,096 ms to 1,264 ms at 32K context (136×) compared to FP16 prefix caching baselines like vllm-mlx.

arXiv:2603.04428 Read explainer

Benchmark

ARKV: Adaptive and Resource-Efficient KV Cache Management under Limited Memory Budget for Long-Context Inference in LLMs

Jianlong Lei, Shashikant Ilager

· 2026

ARKV dynamically combines Per-layer OQ ratio estimation, Token importance scoring, and Tri-state cache assignment to manage KV cache precision under a global memory budget. On LongBench, ARKV reaches 0.972 relative performance versus 0.979 for Origin while achieving 4× KV memory reduction and maintaining ~86% Tokens Per Second.

arXiv:2603.08727 Read explainer

Benchmark

Breaking the KV Cache Bottleneck: Fan Duality Model Achieves O(1) Decode Memory with Superior Associative Recall

Yasong Fan

· 2026

Fan Duality Model (FDM) uses the Fan Operator, Local-Global Cache, Freeze-Scan Training, and Holographic Reference Beam Decoding to separate wave-like compression from particle-like associative recall. On WikiText-103, Fan Duality Model (FDM) reaches 64.9 perplexity with Freeze-Scan and 62.79 with holographic decoding, while achieving 0.966 MQAR accuracy compared to Transformer at 0.606.

arXiv:2604.07716 Read explainer

Benchmark

GradMem: Learning to Write Context into Memory with Test-Time Gradient Descent

Yuri Kuratov, Matvey Kairov et al.

· 2026

GradMem combines a WRITE phase, a READ phase, a context encoder Eθ, a self-supervised WRITE objective Lwrite, and a meta-learned initialization M0 to optimize prefix memory tokens via test-time gradient descent while keeping model weights frozen. On associative KV-retrieval with 96 key–value pairs, GradMem with 5 gradient WRITE steps reaches 88.4% exact match versus 12.9% for forward-only RMT with the same 8-vector memory.

arXiv:2603.13875 Read explainer

BenchmarkRAG

Knowledge Packs: Zero-Token Knowledge Delivery via KV Cache Injection

Andrey Pustovit

· 2026

Knowledge Packs pre-compute KV Cache Injection, KV–Prefix Equivalence, Banked Routing, and KV Composition to deliver retrieved knowledge and steering via KV states instead of prompt tokens. On HotpotQA, Knowledge Packs’ KV-chat matches RAG at 65.2% EM on Qwen3-8B with 0/500 divergences while eliminating 284 tokens of retrieval text per query.

arXiv:2604.03270 Read explainer

BenchmarkBenchmarkCognitive Architecture

Learning to Forget: Sleep-Inspired Memory Consolidation for Resolving Proactive Interference in Large Language Models

Ying Xie

· 2026

SleepGate augments transformers with a Conflict-Aware Temporal Tagger, Forgetting Gate, Consolidation Module, and Sleep Trigger that periodically rewrite the KV cache during sleep micro-cycles. On the PI-LLM benchmark, SleepGate achieves 99.5% retrieval accuracy at PI depth 5 and 97.0% at depth 10, while full KV cache, sliding window, H2O, StreamingLLM, and a decay-only ablation all stay below 18% across all depths.

arXiv:2603.14517 Read explainer

BenchmarkRAG

MEPIC: Memory Efficient Position Independent Caching for LLM Serving

Qian Wang, Zahra Yousefijamarani et al.

· 2025

MEPIC extends vLLM with a Chunk Cache Coordinator, Chunk Matcher, Hybrid KV Manager, Chunk LRU Manager, and Chunk Processor to manage canonical, page-aligned, position-independent KV chunks in HBM. On long-context workloads, MEPIC reduces HBM usage by up to 5.21× and lowers latency by up to 11.48% compared to CacheBlend on Mistral-7B-Instruct-v0.3.

arXiv:2512.16822 Read explainer

Benchmark

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Amir Zandieh, Majid Daliri et al.

· 2025

TurboQuant combines MSE Optimal TurboQuant, Inner-product Optimal TurboQuant, QJL, Random Rotation Matrix Π, and Lloyd-Max Quantizer to quantize vectors online with near-optimal distortion-rate guarantees. TurboQuant matches the Shannon lower bound within a factor of √(3π/2)≈2.7 for MSE and achieves absolute quality neutrality for KV cache quantization at 3.5 bits per channel compared to full-precision baselines.

arXiv:2504.19874 Read explainer