MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

AuthorsQingyao Ai, Yichen Tang, Changyue Wang et al.

arXiv 20252025

TL;DR

MemoryBench uses a Task Provider, User Simulator, and Performance Monitor to benchmark LLM continual learning, revealing memory systems often trail simple RAG baselines on diverse tasks.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

LLM memory benchmarks ignore procedural feedback and continual learning

Existing LLM memory benchmarks mostly use homogeneous long-input short-output reading comprehension tasks and static contexts, ignoring user feedback during service time.

This means LLM systems are rarely tested on procedural memory or explicit and implicit feedback, so they cannot be evaluated on continual learning over real interaction logs.

HOW IT WORKS

MemoryBench — declarative and procedural memory under simulated feedback

MemoryBench centers on three components: Task Provider, User Simulator, and Performance Monitor, plus a taxonomy over Declarative Memory and Procedural Memory with explicit and implicit feedback.

You can think of MemoryBench like a lab where the hippocampus (memory) and behavior analysis modules of a search engine are stress-tested under controlled but realistic user interactions.

This setup lets MemoryBench probe what continual learning over feedback logs enables that a plain context window and static corpus evaluation cannot reveal.

DIAGRAM

MemoryBench interaction flow between LLM systems and simulated users

This diagram shows how MemoryBench simulates user feedback with a LLM-as-user paradigm and logs it as procedural memory for later evaluation.

DIAGRAM

MemoryBench data and evaluation pipeline across datasets

This diagram shows how MemoryBench prepares datasets, partitions them, simulates feedback on training data, and evaluates LLM systems on test data.

PROCESS

How MemoryBench Handles a Continual Learning Session

01
Task Provider
Task Provider formats each case as (q, v, c), sampling across domains and task formats so MemoryBench can stress different memory regimes.
02
Feedback Simulation
User Simulator uses a LLM-as-user paradigm plus programmable mappings to generate verbose feedback and action feedback for MemoryBench training queries.
03
Memory Construction
LLM systems under MemoryBench ingest task context and feedback logs as declarative and procedural memory, building indices or memory stores for later retrieval.
04
Performance Monitor
Performance Monitor evaluates MemoryBench test queries using native metrics and a LLM-as-judge merger, tracking continual learning gains across partitions.

KEY CONTRIBUTIONS

Key Contributions

01
Memory taxonomy and feedback types
MemoryBench formalizes Declarative Memory vs Procedural Memory, and distinguishes Semantic, Episodic, explicit verbose, and implicit action feedback across 20k task cases.
02
User feedback simulation framework
MemoryBench introduces a User Simulator combining LLM-as-user and programmable action mappings to generate explicit and implicit feedback that humans struggle to distinguish from real logs.
03
Comprehensive continual learning benchmark
MemoryBench aggregates 11 datasets over open, legal, and academic domains, covering LiSo, SiLo, LiLo, and SiSo tasks to evaluate memory systems like A-Mem, Mem0, MemoryOS, and RAG baselines.

RESULTS

By the Numbers

Total task cases

20,000 cases

covers 11 datasets across 3 domains

Dataset count

11 datasets

Locomo, DialSim, LexEval, JuDGE, IdeaBench, LimitGen-Syn, WritingPrompts, HelloBench, WritingBench, NF-Cats, SciTechNews

Task formats

4 formats

LiSo, SiLo, LiLo, SiSo partitions

Language coverage

2 languages

English and Chinese tasks in MemoryBench

MemoryBench spans 20k cases from 11 datasets with LiSo, SiLo, LiLo, and SiSo formats, revealing that memory systems like A-Mem, Mem0, and MemoryOS often fail to surpass simple BM25 or embedding-based RAG on several partitions.

BENCHMARK

By the Numbers

BENCHMARK

MemoryBench dataset composition by domain

Proportion of task cases across Open, Legal, and Academic domains in MemoryBench.

KEY INSIGHT

The Counterintuitive Finding

MemoryBench shows that advanced memory systems like A-Mem, Mem0, and MemoryOS often do not beat naive BM25 or embedding RAG on several partitions.

This is surprising because these systems were reported as strong on long-context benchmarks like Locomo, but MemoryBench reveals limited generalizability across heterogeneous tasks and procedural feedback.

WHY IT MATTERS

What this unlocks for the field

MemoryBench gives researchers a controlled way to test how LLM systems use declarative and procedural memory under realistic explicit and implicit feedback.

Builders can now compare memory architectures and optimization algorithms beyond long-context QA, targeting continual learning over rich feedback logs across domains and task formats.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…