ARKV: Adaptive and Resource-Efficient KV Cache Management under Limited Memory Budget for Long-Context Inference in LLMs

AuthorsJianlong Lei, Shashikant Ilager

2026

TL;DR

ARKV uses layer-wise OQ ratios plus tri-state KV caching to retain 0.972 LongBench accuracy while cutting KV memory 4× with only ~14.4% quantized tokens.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Long-context KV cache dominates GPU memory at 40GB for 1k tokens

KV cache memory in LLaMA 3.1 70B reaches 40GB with batch size 128 and sequence length 1024, quickly exhausting GPU capacity.

This makes long-context LLM inference a memory-bound problem, limiting throughput and batch size for agentic workflows and long-document reasoning.

HOW IT WORKS

ARKV — Tri-state KV cache with layer-aware OQ ratios

ARKV combines Per-layer OQ ratio estimation, Token importance via heavy-hitter scoring, and Tri-State Cache Tailor and Mixed-Precision Integration to assign each token Original, Quantization, or Eviction states.

You can think of ARKV like a smart memory controller that keeps hot data in fast RAM, pushes warm data into compressed storage, and drops cold data entirely.

This design lets ARKV stretch the effective context window under a strict KV budget, something a plain fixed-precision context window cannot achieve without large accuracy loss.

DIAGRAM

Decoding-time KV cache tailoring in ARKV

This diagram shows how ARKV tailors the KV cache during decoding using heavy-hitter scores and the tri-state assignment under a global budget.

DIAGRAM

ARKV evaluation pipeline across models and budgets

This diagram shows how ARKV is evaluated on different LLaMA3 and Qwen3 models, cache budgets, and benchmarks like LongBench and GSM8K.

PROCESS

How ARKV Handles a Long-Context Decoding Session

01
Prefill Phase
ARKV runs the Prefill Phase to collect attention scores A and compute entropy, variance, and kurtosis for each layer using Per-layer OQ ratio estimation.
02
Per-layer OQ Ratio Estimation
ARKV converts these statistics into OQ scores qℓ and OQ ratios ρℓ, then allocates B(ℓ)o and B(ℓ)q budgets for Original and Quantization tokens per layer.
03
Token Importance via Heavy-Hitter Scoring
During decoding, ARKV computes heavy-hitter scores S(ℓ)k from mean µ(ℓ)k and variance σ(ℓ)k over recent attention, aligned with grouped-query attention.
04
Tri-State Cache Tailor and Mixed-Precision Integration
ARKV applies the Tri-State Cache Tailor to select Io, Iq, and Ie, then reconstructs a unified mixed-precision KV cache before each attention step.

KEY CONTRIBUTIONS

Key Contributions

01
Tri-state KV cache management framework
ARKV introduces a tri-state KV cache management framework that unifies eviction and quantization via Tri-State Cache Tailor and Mixed-Precision Integration under a global memory budget.
02
Layer-aware Original–Quantization ratio
ARKV proposes a lightweight Per-layer OQ ratio estimation mechanism using entropy, variance, and kurtosis to guide per-layer Original and Quantization budgets.
03
Fast online heavy-hitter scoring
ARKV designs Token importance via heavy-hitter scoring that ranks tokens online and achieves ∼14.4% quantization ratio while retaining ∼97% LongBench accuracy.

RESULTS

By the Numbers

LongBench relative score

0.972

-0.007 vs Base Origin (0.979)

Base Quant relative score

0.398

-0.574 vs ARKV on LongBench

GSM8K accuracy 512

0.697

+0.687 over Base Quant (0.010)

Quant Ratio

14.39%

Evict Ratio 87.80% at budget 512

On LongBench, which tests long-context understanding, ARKV reaches 0.972 relative performance versus 0.979 for Origin and 0.398 for Quant, showing that mixed precision preserves accuracy under tight KV budgets. On GSM8K, which stresses math reasoning, ARKV achieves 0.697 accuracy at budget 512 compared to 0.010 for Quant, demonstrating that ARKV avoids catastrophic quantization errors while still compressing the cache.

BENCHMARK

By the Numbers

BENCHMARK

Overall Relative Performance on LongBench

Relative performance on LongBench (Base normalized to 1.00) for different KV cache strategies.

KEY INSIGHT

The Counterintuitive Finding

ARKV uses only about 14.4% quantized tokens while evicting up to 87.80% of tokens at budget 512, yet still keeps 0.972 LongBench performance.

This is surprising because many expect aggressive low-bit quantization to drive memory savings, but ARKV shows that careful eviction plus limited quantization can preserve accuracy far better than uniform 4-bit or 8-bit compression.

WHY IT MATTERS

What this unlocks for the field

ARKV enables long-context inference with 4× KV memory reduction while retaining ∼97% accuracy and ∼86% throughput, even on demanding tasks like LongBench and GSM8K.

Builders can now deploy LLaMA3 and Qwen3 with ultra-long contexts on single GPUs, using ARKV as a drop-in KV manager without retraining or architectural changes.

~13 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…