AMV-L: Lifecycle-Managed Agent Memory for Tail-Latency Control in Long-Running LLM Systems

AuthorsEmmanuel Bamidele

2026

TL;DR

AMV-L uses value-driven lifecycle tiers with bounded retrieval eligibility to cut p99 latency from 5398.167 ms (TTL) to 1233.430 ms while boosting throughput 3.1×.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Tail latency from unbounded retrieval working sets (13.8% of TTL requests exceed 2s)

TTL-based memory keeps many items eligible for retrieval, so candidate sets and vector scans grow, causing heavy-tailed latency and unstable throughput.

Under TTL, 13.813% of requests exceed 2 seconds and p99 latency reaches 5398.167 ms, making long-running LLM agents violate tail SLOs and waste capacity.

HOW IT WORKS

AMV-L: Adaptive Memory Value Lifecycle

AMV-L combines a Memory Value Model, Tiered Lifecycle, Bounded Retrieval Path, and Lifecycle Manager to control which items stay on the retrieval path.

You can think of AMV-L like an OS memory hierarchy: the Hot tier is RAM, Warm/Cold tiers are slower storage, and eligibility is a budgeted working set.

This value-driven lifecycle lets AMV-L bound retrieval cost independently of total retained memory, something a plain context window or TTL retention cannot provide.

DIAGRAM

Request-time Retrieval and Prompt Construction Flow

This diagram shows how AMV-L processes a single request, from query embedding through tier-aware retrieval, value updates, and prompt construction.

DIAGRAM

Evaluation Pipeline for TTL, LRU, and AMV-L

This diagram shows how the workload is replayed three times with different memory policies to compare TTL, LRU, and AMV-L under identical conditions.

PROCESS

How AMV-L Handles a Request — Adaptive Memory Value Lifecycle

01
Memory items and value state
AMV-L represents each memory item with content, metadata, an embedding, and a scalar value V(m) updated online by the Memory Value Model.
02
Tiered lifecycle organization
AMV-L assigns items to Hot, Warm, or Cold tiers using the Tiered Lifecycle, separating long-term retention from retrieval eligibility.
03
Lifecycle transitions
The Lifecycle Manager promotes, demotes, and evicts items based on value thresholds and hysteresis, keeping the Hot tier as the effective working set.
04
Retrieval and prompt construction
The Bounded Retrieval Path builds R = TH ∪ Samplek(TW), runs similarity search, applies the prompt-injection cap n, and feeds selected chunks into the LLM Answer Pipeline.

KEY CONTRIBUTIONS

Key Contributions

01
Persistent agent memory as a request path systems resource
AMV-L formalizes memory as a working-set resource and shows that bounding retrieval eligibility cuts p99 latency from 5398.167 ms (TTL) to 1233.430 ms.
02
Adaptive Memory Value Lifecycle
AMV-L introduces a value-driven Tiered Lifecycle with Hot, Warm, and Cold tiers plus a Memory Value Model that uses decay and reinforcement to manage eligibility.
03
Tail latency and throughput tradeoff frontier
AMV-L demonstrates a tradeoff versus LRU, with +26% median latency but −15% p99 and −98% requests over 2s, while using ≈6% fewer tokens per request.

RESULTS

By the Numbers

Throughput (req/s)

36.977 req/s

+27.950 over TTL

Latency p99 (ms)

1233.430 ms

−4164.737 vs TTL

Latency > 2s (%)

0.007%

−13.806 percentage points vs TTL

Tokens/request mean

675.388 tokens

≈6% fewer than LRU at similar retrieval quality

On a synthetic long-running workload with 50,000 writes and 20,000 retrieval and Ask requests, AMV-L is compared against TTL and LRU. The results show that AMV-L restores predictable latency tails and high throughput while keeping token and retrieval quality metrics comparable to LRU.

BENCHMARK

By the Numbers

BENCHMARK

Reliability and end to end performance (TTL vs LRU vs AMV-L)

Throughput (req/s) comparison for TTL, LRU, and AMV-L under the shared long-running workload.

KEY INSIGHT

The Counterintuitive Finding

AMV-L scans more vectors at p95 than LRU (690 vs 261) yet still achieves lower p99 latency (1233.430 ms vs 1452.706 ms).

This is surprising because we usually assume fewer scanned vectors always mean better tails, but AMV-L’s value-driven eligibility shows that which items are eligible matters more than raw scan count.

WHY IT MATTERS

What this unlocks for the field

AMV-L gives LLM agents a controllable memory working set, so tail latency depends on Hot tier size, not total retained items.

Builders can now run long-lived agents with large persistent memories while still meeting strict p95 and p99 SLOs, without aggressive TTL deletion or manual memory pruning.

~14 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

Agent Memory

ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents

Xiaohui Zhang, Zequn Sun et al.

· 2026

ActMem transforms dialogue history into atomic facts via Memory Fact Extraction, groups them with Fact Clustering, links them through a Memory KG Construction module, and uses Counterfactual-based Retrieval and Reasoning for action-aware answers. On ActMemEval, ActMem reaches 76.52% QA accuracy with DeepSeek-V3, beating LightMem’s 63.97% by 12.55 points and NaiveRAG’s 61.54%.

arXiv:2603.00026 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…