Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices

AuthorsYakov Pyotr Shkolnikov

2026

TL;DR

Agent Memory Below the Prompt uses persistent Q4 KV cache with a block pool and BatchQuantizedKVCache to cut Gemma 3 TTFT by up to 136× at 32K context.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Multi-Agent Edge Systems Stall on 15.7 s Re-Prefill per Agent

On Apple M4 Pro, a 10-agent workflow must constantly evict caches, forcing a 15.7 seconds re-prefill per agent at 4K context.

This cripples multi-agent LLM workflows, where time-to-first-token explodes and edge devices with 10.2 GB cache budget cannot keep more than 3 agents at 8K context.

HOW IT WORKS

Persistent Q4 KV Cache with Block Pool and BatchQuantizedKVCache

Agent Memory Below the Prompt combines a block pool, Q4 quantization pipeline, BatchQuantizedKVCache, and cross-phase context injection to persist per-agent KV caches in safetensors format.

Think of RAM as a working desk and SSD as a filing cabinet: Agent Memory Below the Prompt moves KV cache pages between them, using Q4 compression to fit four times more agents in the same drawer.

This design lets Agent Memory Below the Prompt restore attention state directly into the model, eliminating O(n) prefill and enabling 22–136× TTFT reductions that plain context windows and FP16 prefix caching cannot achieve.

DIAGRAM

Interleaved Multi-Agent Inference Flow with Cache Reload

This diagram shows how Agent Memory Below the Prompt interleaves agent decode and Q4 cache reload to hide disk latency during multi-agent inference.

DIAGRAM

Evaluation Matrix and TTFT Measurement Pipeline

This diagram shows how Agent Memory Below the Prompt measures TTFT across models, cache states, and context lengths.

PROCESS

How Agent Memory Below the Prompt Handles a Multi-Phase Agent Session

01
Block Pool with Per-Agent Isolation
Agent Memory Below the Prompt allocates fixed 256 token KV blocks per agent in the block pool, using ModelCacheSpec to capture architecture specific cache layout.
02
Q4 Quantization Pipeline
Agent Memory Below the Prompt quantizes FP16 KV tensors into 4 bit packed uint32 plus bfloat16 scales and biases, achieving a 0.281 Q4 FP16 ratio with 72 percent memory reduction.
03
Batched Quantized Inference
Agent Memory Below the Prompt merges multiple agents into BatchQuantizedKVCache, runs chunked prefill and interleaved decode with ConcurrentScheduler, then extracts per agent caches.
04
Cross-Phase Context Injection
Agent Memory Below the Prompt reloads prior phase KV state, enforces monotonic prompt extension for EXTEND matches, and accumulates working memory across phases without re computation.

KEY CONTRIBUTIONS

Key Contributions

01
Persistent Block Pool with Per-Agent Isolation
Agent Memory Below the Prompt introduces a block pool that stores each agent’s Q4 KV cache as safetensors, surviving server restarts and fitting 4K Gemma contexts in 0.42 GB instead of 1.5 GB.
02
BatchQuantizedKVCache for Concurrent Q4 Inference
Agent Memory Below the Prompt adds BatchQuantizedKVCache and a ConcurrentScheduler to run batched quantized inference, reaching 22.6 system TPS at 1K warm context for Gemma with two agents.
03
Cross-Phase Context Injection as Working Memory
Agent Memory Below the Prompt implements cross-phase context injection that reuses KV state across multi-phase workflows, halving Phase 5 TTFT from 3292 ms to 1705 ms in the interrogation scenario.

RESULTS

By the Numbers

TTFT Gemma 32K Cold

172096 ms

+170832 ms over Gemma 32K Hot (1264 ms)

TTFT Gemma 32K Hot

1264 ms

136× faster than Gemma 32K Cold

TTFT Llama 16K Warm

431 ms

111× faster than Llama 16K Cold (47629 ms)

Gemma 4K Warm TTFT

577 ms

27× faster than Gemma 4K Cold (15736 ms)

These metrics come from the TTFT scaling experiments on Gemma 3 12B and Llama 3.1 8B, measuring cold, warm, and hot cache states across 1K–32K contexts. The results show that Agent Memory Below the Prompt turns multi minute cold prefill into sub second cache restoration while preserving throughput.

BENCHMARK

By the Numbers

BENCHMARK

TTFT (ms) by Cache State for Gemma 3 at 4K Context

Time to first token for Gemma 3 12B at 4K tokens under different cache states.

KEY INSIGHT

The Counterintuitive Finding

At short contexts like 1K–8K, warm disk reload in Agent Memory Below the Prompt is up to 40–55 percent faster than hot in memory cache.

This is surprising because developers expect SSD I O to be slower than RAM, but sequential mx.load reads beat scattered in memory hash lookups for small Q4 safetensors files.

WHY IT MATTERS

What this unlocks for the field

Agent Memory Below the Prompt makes it practical to run 5–20 agent systems on fixed RAM edge devices while keeping per agent context effectively unbounded via Q4 persistence.

Builders can now ship on device multi agent assistants that survive restarts, avoid 15.7 second re prefills, and maintain isolated working memory per agent without datacenter scale hardware.

~16 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…