MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

AuthorsQingyao Ai, Yichen Tang, Changyue Wang et al.

arXiv 20252025

TL;DR

MemoryBench uses a Task Provider, User Simulator, and Performance Monitor to stress-test LLM continual learning, revealing that MemoryOS can be over 30× slower than RAG while not improving scores.

THE PROBLEM

LLM systems lack continual learning despite rich user feedback logs

MemoryBench highlights that existing benchmarks "ignore the dynamic nature of continual learning" and "do not support the simulation or evaluation of procedural memory built from test-time user feedback.".

This means LLMsys are evaluated mostly on static long-context reading tasks, so they cannot learn from explicit or implicit feedback, limiting real-world service improvement.

HOW IT WORKS

MemoryBench framework for memory and feedback utilization

MemoryBench centers on three modules: Task Provider, User Simulator, and Performance Monitor, plus explicit Declarative Memory and Procedural Memory via feedback logs S.

You can think of MemoryBench like a search engine lab: Task Provider is the index, User Simulator is the traffic generator, and Performance Monitor is the analytics dashboard.

This design lets MemoryBench test how LLMsys use feedback logs beyond a plain context window, exposing whether systems can truly build and exploit procedural memory.

DIAGRAM

MemoryBench taxonomy of memory and feedback

This diagram shows how MemoryBench categorizes declarative versus procedural memory and explicit versus implicit feedback.

DIAGRAM

MemoryBench evaluation pipeline across datasets

This diagram shows how MemoryBench partitions datasets, simulates feedback on training data, and evaluates LLMsys on held-out test cases.

PROCESS

How MemoryBench Handles a Task Case

  1. 01

    Data Preparation

    MemoryBench uses Task Provider to format each case as (q, v, c), unifying query q, evaluation metadata v, and context c across 11 datasets.

  2. 02

    Feedback Simulation

    MemoryBench runs User Simulator with a LLM as user to generate verbose feedback and action feedback, forming procedural memory in feedback logs S.

  3. 03

    Memory Construction

    MemoryBench feeds task context c as declarative memory and feedback logs S as procedural memory into LLMsys, including RAG, A-Mem, Mem0, and MemoryOS.

  4. 04

    Performance Monitoring

    MemoryBench evaluates LLMsys on test cases via Performance Monitor, merges metrics with LLM as judge, and normalizes scores across datasets.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    MemoryBench benchmark design

    MemoryBench introduces a three module framework with Task Provider, User Simulator, and Performance Monitor to evaluate memory and continual learning in LLMsys across 20k cases.

  • 02

    Declarative and procedural memory coverage

    MemoryBench explicitly models Declarative Memory and Procedural Memory, providing semantic, episodic, verbose, and action feedback signals in multiple domains and languages.

  • 03

    Comprehensive baseline evaluation

    MemoryBench systematically evaluates Vanilla, BM25-S, BM25-M, Embed-S, Embed-M, A-Mem, Mem0, and MemoryOS, revealing that advanced systems often fail to beat naive RAG.

RESULTS

By the Numbers

Cases in MemoryBench

20,000 cases

covers 11 datasets with LiSo, SiLo, LiLo, SiSo formats

Datasets included

11 datasets

spanning open domain, legal, and academic tasks

Max input length

383,054 tokens

DialSim theoffice average input length in tokens

Max output length

1,628.04 tokens

HelloBench A.K. QA average output length

MemoryBench aggregates Locomo, DialSim, LexEval, JuDGE, IdeaBench, LimitGen-Syn, WritingPrompts, HelloBench, WritingBench, NF-Cats, and SciTechNews. These benchmarks test long context reasoning, dialog, legal drafting, creative writing, and news summarization, showing how MemoryBench stresses LLMsys memory and feedback utilization.

BENCHMARK

By the Numbers

MemoryBench aggregates Locomo, DialSim, LexEval, JuDGE, IdeaBench, LimitGen-Syn, WritingPrompts, HelloBench, WritingBench, NF-Cats, and SciTechNews. These benchmarks test long context reasoning, dialog, legal drafting, creative writing, and news summarization, showing how MemoryBench stresses LLMsys memory and feedback utilization.

BENCHMARK

Benchmark: MemoryBench overall off policy results

Min max normalized performance score on MemoryBench partitions with explicit verbose feedback in off policy setting.

KEY INSIGHT

The Counterintuitive Finding

MemoryBench shows that none of A-Mem, Mem0, or MemoryOS consistently beat simple RAG baselines like BM25-S and Embed-M across domains and task formats.

This is surprising because prior work reported strong gains on single datasets like Locomo, but MemoryBench reveals limited generalizability once procedural feedback logs and heterogeneous tasks are introduced.

WHY IT MATTERS

What this unlocks for the field

MemoryBench gives researchers a unified way to test declarative and procedural memory, explicit and implicit feedback, and continual learning in realistic LLM service scenarios.

Builders can now design memory architectures and optimization algorithms knowing they will be stress-tested on feedback logs S across 20k diverse cases, rather than only on static long-context QA.

~15 min read← Back to papers

Related papers

BenchmarkLong-Term Memory

Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs

Mohammad Tavakoli, Alireza Salemi et al.

arXiv 2025 · 2025

LIGHT augments LLMs with **Retrieval from the Conversation**, **Scratchpad Formation and Utilization**, and a **Working Memory** buffer plus noise filtering to answer BEAM’s long-context probing questions. On the BEAM benchmark, LIGHT raises GPT-4.1-nano’s average score at 10M-token conversations from 0.109 to 0.226, a +107.3% gain over the vanilla long-context baseline.

BenchmarkAgent Memory

Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Yuanzhe Hu, Yu Wang, Julian McAuley

ICLR 2026 · 2025

MemoryAgentBench standardizes multi-turn datasets into chunked conversations with memorization prompts, then evaluates long-context agents, RAG agents, and agentic memory agents across Accurate Retrieval, Test-Time Learning, Long-Range Understanding, and Selective Forgetting. On the overall score in Table 3, the GPT-4.1-mini long-context agent reaches 71.8 on Accurate Retrieval tasks compared to 49.2 for the GPT-4o-mini long-context baseline.

BenchmarkAgent Memory

MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents

Haoran Tan, Zeyu Zhang et al.

ACL 2025 · 2025

MemBench evaluates LLM-based agents via a **User Relation Graph**, **Memory Dataset Construction**, **Multi-scenario Memory**, and **Multi-level Memory** pipeline, then scores seven memory mechanisms with multi-metric evaluation. On 100k-token factual observation tests, MemBench shows **RetrievalMemory** achieves 0.933 accuracy versus 0.631 for **FullMemory**, quantifying both effectiveness and efficiency tradeoffs.