MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents

AuthorsHaoran Tan, Zeyu Zhang, Chen Ma et al.

ACL 20252025

TL;DR

MemBench uses multi-scenario, multi-level memory datasets plus four metrics to reveal that many agent memories collapse when context grows to 100k tokens.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Memory benchmarks ignore reflective memory and long noisy sessions

Previous evaluations are commonly limited by the diversity of memory levels and interactive scenarios, and lack comprehensive metrics.

This means LLM-based agents are judged mostly on factual memory in participation scenarios, ignoring reflective memory, observation-only use, efficiency, and capacity under long noisy histories.

HOW IT WORKS

MemBench — multi-scenario, multi-level, multi-metric memory evaluation

MemBench centers on a Multi-scenario Dataset, Multi-level Memory Content, and Multi-metric Evaluation built over user relation graphs and simulated dialogues.

You can think of MemBench like a stress lab where an agent’s memory is probed as conversations grow, mixing short-term “RAM” turns with long-term “disk” histories plus structured questions.

This design lets MemBench expose failures in accuracy, recall, capacity, and temporal efficiency that a plain context window or single-metric long-context test cannot reveal.

DIAGRAM

MemBench interaction flow across participation and observation scenarios

This diagram shows how MemBench feeds participation and observation data into LLM-based agents and then queries them for memory evaluation.

DIAGRAM

MemBench data generation and evaluation pipeline

This diagram shows how MemBench builds user relation graphs, generates dialogues, injects noise, and runs multi-metric evaluation.

PROCESS

How MemBench Handles a Memory Evaluation Session

  1. 01

    User Relation Graph Sampling

    MemBench builds a user relation graph from profiles and entities, using User’s Relation Graph Sampling to capture people, events, places, and items.

  2. 02

    Memory Dataset Construction

    MemBench applies Memory Dataset Construction to turn sampled attributes into participation dialogues and observation message lists with time-based session division.

  3. 03

    Multi-scenario Memory

    MemBench instantiates Multi-scenario Memory by separating participation memory scenario dialogues from observation memory scenario message flows for agents.

  4. 04

    Multi-metric Evaluation

    MemBench runs Multi-metric Evaluation, computing memory accuracy, memory recall, memory capacity, and memory efficiency over Sub-dataset 1 and Sub-dataset 2.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Multi-scenario Dataset

    MemBench constructs a Multi-scenario Dataset with 51k participation factual sessions and 8.5k observation factual sessions, plus reflective counterparts, covering both PS and OS.

  • 02

    Multi-level Memory Content

    MemBench introduces Multi-level Memory Content combining factual memory and reflective memory, enabling tasks like cross-session reasoning, temporal reasoning, and reflective summarization.

  • 03

    Multi-metric Benchmark

    MemBench defines a Multi-metric Evaluation with memory accuracy, memory recall, memory capacity, and memory efficiency, and evaluates seven memory mechanisms under 10k and 100k-token conditions.

RESULTS

By the Numbers

Participation factual accuracy 10k

0.692

+0.053 over FullMemory on participation factual 10k

Observation factual accuracy 100k

0.933

+0.302 over FullMemory on observation factual 100k

Retrieval factual Recall@10 10k

0.776

retrieval effectiveness for key evidence dialogues

Retrieval factual Recall@10 100k

0.749

capacity under 100k-token noisy sessions

MemBench reports these numbers on its factual memory benchmark, where Sub-dataset 1 has about 10k tokens per session and Sub-dataset 2 about 100k tokens. These results show that RetrievalMemory maintains high accuracy and recall even as MemBench scales context length and injects noise.

BENCHMARK

By the Numbers

MemBench reports these numbers on its factual memory benchmark, where Sub-dataset 1 has about 10k tokens per session and Sub-dataset 2 about 100k tokens. These results show that RetrievalMemory maintains high accuracy and recall even as MemBench scales context length and injects noise.

BENCHMARK

Participation factual accuracy on Sub-dataset 1 (10k tokens)

Accuracy on MemBench participation factual memory 10k-token setting from Table 3.

KEY INSIGHT

The Counterintuitive Finding

MemBench shows RetrievalMemory’s participation factual accuracy increases from 0.692 at 10k tokens to 0.833 at 100k tokens, despite much more noise.

This is surprising because longer, noisier histories usually hurt performance, yet MemBench reveals retrieval-based agents can benefit from more context when retrieval is precise.

WHY IT MATTERS

What this unlocks for the field

MemBench unlocks a way to stress-test agent memories across factual and reflective levels, participation and observation, while tracking accuracy, recall, capacity, and efficiency.

With MemBench, builders can systematically compare memory mechanisms like MemGPT, GenerativeAgent, and RetrievalMemory under realistic 100k-token noisy conditions, rather than relying on narrow long-context scores.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

Agent Memory

ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents

Xiaohui Zhang, Zequn Sun et al.

· 2026

ActMem transforms dialogue history into atomic facts via Memory Fact Extraction, groups them with Fact Clustering, links them through a Memory KG Construction module, and uses Counterfactual-based Retrieval and Reasoning for action-aware answers. On ActMemEval, ActMem reaches 76.52% QA accuracy with DeepSeek-V3, beating LightMem’s 63.97% by 12.55 points and NaiveRAG’s 61.54%.

Questions about this paper?

Paper: MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents

Answers use this explainer on Memory Papers.

Checking…