MEPIC: Memory Efficient Position Independent Caching for LLM Serving

AuthorsQian Wang, Zahra Yousefijamarani, Morgan Lindsay Heisler et al.

2025

TL;DR

MEPIC uses page-aligned, position-independent KV caching with fused RoPE attention to cut HBM usage by up to 5.21× while matching EPIC and CacheBlend accuracy.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

HBM KV Caches Thrash Under Reused Chunks and 5× Duplication

Modern RAG and coding agents repeatedly process long prompts with shared chunks, yet existing PIC systems only achieve modest HBM savings even with heavy reuse.

Without cross-request HBM reuse, each request recomputes or reloads KV for popular chunks, multiplying HBM footprint by overlapping requests and causing cache thrashing, higher tail latency, and lower throughput.

HOW IT WORKS

MEPIC — Chunk-Aware KV Management with Fused RoPE Attention

MEPIC’s core mechanism wires Chunk Processor, Chunk Matcher, Hybrid KV Manager, Chunk Cache Coordinator, and Chunk LRU Manager into vLLM’s paged KV store to manage reusable chunks.

You can think of MEPIC as treating hot document chunks like cache lines in a CPU, with HBM as L1 and LMCache as a backing store.

By enforcing canonical, page-aligned, position-independent KV via fused RoPE attention, MEPIC enables cross-request chunk sharing that a plain context window or prefix cache cannot provide.

DIAGRAM

Query-Time Flow with Selective KV Recomputation

This diagram shows how MEPIC processes a query through segmentation, residency resolution, selective KV recomputation, and fused RoPE attention in the computation path.

DIAGRAM

Evaluation Pipeline Across Datasets and QPS

This diagram shows how MEPIC is evaluated on four QA datasets under varying QPS and context lengths against EPIC and CacheBlend.

PROCESS

How MEPIC Handles a Request in the Scheduling and Computation Paths

01
Segmentation and Canonicalization
MEPIC uses the Chunk Processor to partition tokens into chunk and prompt segments, then pads them asymmetrically to enforce block-aligned canonical KV layouts.
02
Chunk-Aware KV Residency Management
The Chunk Matcher and Hybrid KV Manager consult the Prefix Cache Coordinator and Chunk Cache Coordinator to classify segments as HBM-resident or non-resident and locate LMCache copies.
03
Eviction and Allocation under Pressure
The Chunk LRU Manager reclaims zero-reference chunk KV blocks, while the Hybrid KV Manager allocates shared paged KV blocks for new chunks and prompt segments.
04
Selective KV Recomputation and Fused RoPE Attention
MEPIC recomputes only the first KV block per chunk, commits KV to paged storage, and runs fused RoPE attention over NoPE KV to inject positions on the fly.

KEY CONTRIBUTIONS

Key Contributions

01
Chunk-Aware HBM KV Management
MEPIC introduces a Chunk Cache Coordinator, Hybrid KV Manager, Chunk Matcher, and Chunk LRU Manager to manage canonical chunk pages in a shared HBM pool alongside vLLM’s prefix cache, cutting peak HBM usage by up to 2× over PIC baselines.
02
Position-Independent KV Caching via Fused RoPE Attention
MEPIC stores NoPE KV and fuses RoPE into the attention kernel, enabling deterministic chunk reuse across positions and requests without model changes.
03
System Integration with vLLM and LMCache
MEPIC integrates into the vLLM plus LMCache stack, using deterministic page-aligned chunk materialization and LMCache’s CPU or disk tiers for remote chunk persistence.

RESULTS

By the Numbers

Latency SQuAD

116.03 s

-3.38 s vs CacheBlend on SQuAD

Peak HBM SQuAD

27.67 %

-26.80 pp vs CacheBlend on SQuAD

F1 SQuAD

0.74

+0.01 over CacheBlend on SQuAD

HBM vs QPS

5.74× lower

HBM usage vs CacheBlend at varying QPS

On SQuAD, a reading comprehension benchmark, MEPIC achieves 0.74 F1 with 27.67% peak HBM and 116.03 s latency, slightly improving accuracy while halving HBM versus CacheBlend and EPIC. Across QPS from 2 to 25, MEPIC lowers HBM usage by 5.74× relative to CacheBlend and 5.25× relative to EPIC, demonstrating scalable memory savings.

BENCHMARK

By the Numbers

BENCHMARK

Comparison of Total Latency and HBM Usage Across Baselines on SQuAD

Peak HBM Usage (%) on SQuAD for MEPIC, CacheBlend, and EPIC.

KEY INSIGHT

The Counterintuitive Finding

MEPIC recomputes only the first KV block per chunk yet matches or slightly exceeds CacheBlend and EPIC accuracy, achieving 0.74 F1 on SQuAD versus 0.73 and 0.72.

This is surprising because prior PIC methods assumed larger recomputation budgets, yet MEPIC shows that minimal, deterministic recomputation plus NoPE KV is enough for high fidelity.

WHY IT MATTERS

What this unlocks for the field

MEPIC unlocks page-aligned, cross-request chunk KV sharing in HBM, enabling up to 5.21× lower memory usage for long prompts without changing models.

Builders can now serve multi-turn, long-context RAG and coding agents on fixed HBM budgets while sustaining higher QPS and lower latency than PIC-only systems.

~14 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…