Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

AuthorsDongming Jiang, Yi Li, Songtao Wei, Jinxin Yang

arXiv 20262026

TL;DR

Anatomy of Agentic Memory uses a structure-first taxonomy plus empirical stress tests to show lexical F1 can rank AMem last (0.116) while Nemori leads (0.502) on LoCoMo.

THE PROBLEM

Agentic memory benchmarks saturate and F1 misranks systems (AMem F1 0.116 vs Nemori 0.502)

Anatomy of Agentic Memory shows that many benchmarks fit entirely within modern 128k context windows, making external memory seemingly unnecessary and causing benchmark saturation.

On LoCoMo, Anatomy of Agentic Memory finds AMem gets F1 0.116 while Nemori reaches 0.502, so lexical metrics mis-rank memory systems and obscure true semantic performance.

HOW IT WORKS

Anatomy of Agentic Memory — structural taxonomy plus empirical stress tests

Anatomy of Agentic Memory introduces a four-part taxonomy over Lightweight Semantic Memory, Entity-Centric and Personalized Memory, Episodic and Reflective Memory, and Structured and Hierarchical Memory, then evaluates LOCOMO, AMem, MemoryOS, Nemori, MAGMA, and SimpleMem.

You can think of Anatomy of Agentic Memory like an operating system profiler for agents, measuring how different memory "modules" behave under load instead of proposing yet another RAM upgrade.

This structure-first analysis in Anatomy of Agentic Memory enables reasoning about when memory architectures help beyond a plain context window, and how evaluation metrics, backbones, and latency jointly constrain real deployments.

DIAGRAM

Memory-Augmented Generation Taxonomy in Anatomy of Agentic Memory

This diagram shows how Anatomy of Agentic Memory organizes Memory-Augmented Generation into four structural memory categories and their representative systems.

DIAGRAM

Evaluation Pipeline in Anatomy of Agentic Memory

This diagram shows how Anatomy of Agentic Memory evaluates memory systems across benchmarks, metrics, backbones, and system costs.

PROCESS

How Anatomy of Agentic Memory Handles Evaluation and Analysis Lifecycle

  1. 01

    Taxonomy of Agentic Memory

    Anatomy of Agentic Memory first defines the four memory structures using Lightweight Semantic Memory, Entity-Centric and Personalized Memory, Episodic and Reflective Memory, and Structured and Hierarchical Memory to categorize systems.

  2. 02

    Experimental Setup

    Anatomy of Agentic Memory selects representative systems LOCOMO, AMem, MemoryOS, Nemori, MAGMA, and SimpleMem and pairs them with gpt-4o-mini and Qwen-2.5-3B for controlled comparisons.

  3. 03

    Benchmark Scalability

    Anatomy of Agentic Memory analyzes benchmark volume, interaction depth, and entity diversity on HotpotQA, LoCoMo, LongMemEval, and MemBench to quantify saturation risk under long contexts.

  4. 04

    LLM-as-a-Judge Evaluation

    Anatomy of Agentic Memory compares lexical F1 with semantic judge rankings across three prompts from MAGMA, Nemori, and SimpleMem to expose misalignment and robustness.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Taxonomy of Agentic Memory

    Anatomy of Agentic Memory proposes a structure-first taxonomy over Lightweight Semantic Memory, Entity-Centric and Personalized Memory, Episodic and Reflective Memory, and Structured and Hierarchical Memory to unify diverse MAG designs.

  • 02

    Benchmark Saturation Analysis

    Anatomy of Agentic Memory quantifies saturation risk, showing datasets like MemBench at ∼100k tokens fit in 128k windows, while LongMemEval-M at >1M tokens structurally requires external memory.

  • 03

    Metric and System Limitations

    Anatomy of Agentic Memory demonstrates F1 misranks AMem at 0.116 versus Nemori at 0.502 on LoCoMo and profiles latency, with MemoryOS reaching 32.372 seconds total user latency per turn.

RESULTS

By the Numbers

F1-Score

0.502

+0.386 over AMem

Semantic Judge Score Prompt 1

0.670

vs Nemori 0.602

User-Facing Latency Total (s)

32.372

MemoryOS latency on LoCoMo

Construction Cost Tokens (k)

7044

Nemori offline index tokens

On the LoCoMo long-term memory benchmark, Anatomy of Agentic Memory reports F1 and semantic judge scores plus latency and construction costs, proving that Nemori and MAGMA excel semantically while MemoryOS pays a 32.372 second latency tax and Nemori consumes 7,044k construction tokens.

BENCHMARK

By the Numbers

On the LoCoMo long-term memory benchmark, Anatomy of Agentic Memory reports F1 and semantic judge scores plus latency and construction costs, proving that Nemori and MAGMA excel semantically while MemoryOS pays a 32.372 second latency tax and Nemori consumes 7,044k construction tokens.

BENCHMARK

Robustness of system ranking across evaluation protocols on LoCoMo

F1-Score on LoCoMo long-term memory benchmark.

KEY INSIGHT

The Counterintuitive Finding

Anatomy of Agentic Memory shows AMem has the worst F1 at 0.116 on LoCoMo, yet maintains solid semantic judge rankings consistently at position four.

This is surprising because many assume higher lexical F1 always signals better memory, but Anatomy of Agentic Memory proves abstractive systems can look bad lexically while remaining semantically coherent.

WHY IT MATTERS

What this unlocks for the field

Anatomy of Agentic Memory gives practitioners a concrete framework to choose between Lightweight Semantic, Episodic and Reflective, or Structured and Hierarchical memory based on saturation, metrics, and cost.

Builders can now design MAG systems and benchmarks that explicitly test beyond-context reasoning, select backbones with acceptable format error rates, and budget for latency and construction tokens upfront.

~14 min read← Back to papers

Related papers

Memory ArchitectureSurvey

Multi-Agent Memory from a Computer Architecture Perspective: Visions and Challenges Ahead

Zhongming Yu, Naicheng Yu et al.

arXiv 2026 · 2026

Multi-Agent Memory Architecture organizes **Agent IO Layer**, **Agent Cache Layer**, and **Agent Memory Layer** plus **Agent Cache Sharing** and **Agent Memory Access** protocols into a unified architectural framing for multi-agent systems. The position-only SYS_NAME proposes no benchmark MAIN_RESULT or numeric comparison against any baseline.

Survey

From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs

Yaxiong Wu, Sheng Liang et al.

arXiv 2025 · 2025

From Human Memory to AI Memory organizes LLM memory using the **3D-8Q Memory Taxonomy**, mapping human memory categories to personal and system memory across object, form, and time. From Human Memory to AI Memory reports no new benchmarks but consolidates systems like MemoryBank, HippoRAG, and MemoRAG into a single conceptual framework.

SurveyMemory Architecture

Memory-Augmented Transformers: A Systematic Review from Neuroscience Principles to Enhanced Model Architectures

Parsa Omidi, Xingshuai Huang et al.

arXiv 2025 · 2025

Memory-Augmented Transformers organizes **functional objectives**, **memory types**, and **integration techniques** into a three-axis taxonomy, grounded in biological systems like sensory, working, and long-term memory. The survey synthesizes dozens of architectures to highlight emerging mechanisms such as hierarchical buffering and surprise-gated updates that move beyond static KV caches.