Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

AuthorsDongming Jiang, Yi Li, Songtao Wei, Jinxin Yang

arXiv 20262026

TL;DR

Anatomy of Agentic Memory uses a structure-first taxonomy plus empirical analysis to link four memory structures to benchmark saturation, metric validity, backbone sensitivity, and system cost.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Agentic memory benchmarks saturate under long contexts and mislead evaluation

Anatomy of Agentic Memory shows that benchmarks like MemBench at ∼100k tokens fit in a 128k window, creating context saturation and invalid memory tests.

On LoCoMo, lexical F1 ranks Nemori first at 0.502, while semantic judges reveal MAGMA’s 0.670 score, exposing misaligned evaluation and misleading system comparisons.

HOW IT WORKS

Anatomy of Agentic Memory — structure-first taxonomy plus empirical bottleneck analysis

Anatomy of Agentic Memory introduces Lightweight Semantic Memory, Entity-Centric and Personalized Memory, Episodic and Reflective Memory, and Structured and Hierarchical Memory as four structural categories.

You can think of these as different memory layouts in a computer: flat logs, per-user records, episodic snapshots, and graph or OS-like hierarchies coordinating RAM and disk.

This structure-first view lets Anatomy of Agentic Memory connect memory design to benchmark saturation, metric misalignment, backbone sensitivity, and latency–throughput trade-offs that a plain context window obscures.

DIAGRAM

Memory interaction and evaluation flow across architectures

This diagram shows how Anatomy of Agentic Memory traces user queries through different memory structures and into evaluation of accuracy, saturation, and cost.

DIAGRAM

Evaluation pipeline and context saturation test

This diagram shows how Anatomy of Agentic Memory evaluates MAG systems with the Context Saturation Gap and LLM-as-a-judge across LoCoMo and other benchmarks.

PROCESS

How Anatomy of Agentic Memory Handles a Memory-Augmented Generation Evaluation Session

01
Benchmark Scalability Analysis
Anatomy of Agentic Memory measures volume, interaction depth, and entity diversity to detect context saturation on datasets like HotpotQA, LoCoMo, LongMemEval, and MemBench.
02
Context Saturation Gap
Anatomy of Agentic Memory computes the Context Saturation Gap Δ = ScoreMAG − ScoreFullContext to test whether external memory is structurally beneficial.
03
LLM as a Judge Evaluation
Anatomy of Agentic Memory uses gpt-4o-mini as a semantic judge with three prompt protocols, comparing F1 rankings to semantic scores on LoCoMo.
04
System Performance Evaluation
Anatomy of Agentic Memory profiles retrieval, generation, and maintenance latency plus construction cost for LOCOMO, AMem, MemoryOS, Nemori, MAGMA, and SimpleMem.

KEY CONTRIBUTIONS

Key Contributions

01
Taxonomy of Agentic Memory
Anatomy of Agentic Memory defines four structures—Lightweight Semantic, Entity-Centric and Personalized, Episodic and Reflective, Structured and Hierarchical—clarifying how organization shapes accuracy and efficiency.
02
Benchmark Saturation and Context Gap
Anatomy of Agentic Memory introduces the Context Saturation Gap Δ and shows benchmarks like MemBench at ∼100k tokens can fit in 128k windows, risking invalid memory evaluation.
03
Backbone Sensitivity and System Cost
Anatomy of Agentic Memory quantifies backbone sensitivity, with Nemori’s format error rising from 17.91% on gpt-4o-mini to 30.38% on Qwen-2.5-3B, and exposes MemoryOS’s 32.372s latency.

RESULTS

By the Numbers

F1-Score

0.502

+0.089 over MAGMA (F1 0.467) on LoCoMo

Semantic Judge Score Prompt 2

0.781

+0.483 over SimpleMem (0.298) on LoCoMo

User-Facing Latency Total

32.372 s

MemoryOS vs 1.129 s for Nemori on per-turn latency

Construction Tokens

7044 k

Nemori vs 1308 k for SimpleMem in offline construction

On LoCoMo, Anatomy of Agentic Memory reports Nemori’s 0.502 F1 and 0.781 semantic judge score versus SimpleMem’s 0.268 F1 and 0.298 semantic score, while also contrasting MemoryOS’s 32.372s latency and Nemori’s 7,044k construction tokens to SimpleMem’s 1.057s latency and 1,308k tokens.

BENCHMARK

By the Numbers

BENCHMARK

Robustness of system ranking across evaluation protocols on LoCoMo

Semantic Judge Score Prompt 2 (Nemori rubric) on LoCoMo.

KEY INSIGHT

The Counterintuitive Finding

Anatomy of Agentic Memory shows AMem’s F1 is only 0.116 (rank 5), yet semantic judge scores around 0.48 place it consistently fourth.

This breaks the assumption that higher lexical F1 always means better semantic correctness, revealing that abstractive, coherent systems can look weak under token overlap metrics.

WHY IT MATTERS

What this unlocks for the field

Anatomy of Agentic Memory gives builders a taxonomy and measurement toolkit to choose between lightweight, entity-centric, episodic, and structured memory under real latency and cost constraints.

Developers can now design MAG systems that explicitly trade off Context Saturation Gap, semantic judge scores, backbone stability, and construction tokens instead of optimizing a single misleading metric.

~14 min read← Back to papers

Related papers

SurveyBenchmarkAgent MemoryLong-Term MemoryMemory Architecture

A Survey on the Security of Long-Term Memory in LLM Agents: Toward Mnemonic Sovereignty

Zehao Lin, Chunyu Li, Kai Chen

· 2026

Mnemonic Sovereignty analyzes long term Write, Store, Retrieve, Execute, Share, and Forget Rollback phases against integrity, confidentiality, availability, and governance objectives for agent memory. Mnemonic Sovereignty’s lifecycle matrix shows most of the ~70 works cluster on write and retrieve integrity, leaving store, availability, and governance primitives like write gate validation and post deletion verification almost entirely unexplored.

arXiv:2604.16548 Read explainer

SurveyRAGAgent Memory

Memory for Autonomous LLM Agents:Mechanisms, Evaluation, and Emerging Frontiers

Pengfei Du

· 2026

Memory for Autonomous LLM Agents decomposes agent memory into a POMDP-grounded write–manage–read loop, a three-dimensional taxonomy, and five mechanism families spanning context compression, retrieval stores, reflection, hierarchical virtual context, and policy-learned management. Memory for Autonomous LLM Agents synthesizes results like Voyager’s 15.3× tech-tree speedup and MemoryArena’s 80%→45% drop to show that memory architecture often matters more than backbone choice.

arXiv:2603.07670 Read explainer

Survey

From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs

Yaxiong Wu, Sheng Liang et al.

arXiv 2025 · 2025

From Human Memory to AI Memory maps human memory categories onto AI memory using the 3D-8Q taxonomy with Personal Memory, System Memory, and the Three-Dimensional Eight-Quadrant Memory Taxonomy. The main result is that From Human Memory to AI Memory systematically organizes memory in LLM-driven AI systems across eight quadrants defined by object, form, and time, connecting them to human memory types.

arXiv:2504.15965 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…