MemEvoBench: Benchmarking Memory MisEvolution in LLM Agents

AuthorsWeiwei Xie, Shaoxiong Guo, Fan Zhang et al.

2026

TL;DR

MemEvoBench uses multi-round biased-memory interactions plus a memory correction tool to show ASR often exceeds 80% and that active memory editing can cut QA-style risks for Gemini-2.5-Pro from 67.0% to 19.0% in Round 1.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Memory misevolution drives ASR up to 96.3 percent under biased feedback

MemEvoBench shows that under Vanilla prompting, average Attack Success Rate reaches 75.9% in Round 1 and climbs to 87.8% by Round 3 with biased feedback.

In long-horizon QA and workflow tasks, contaminated memory causes behavioral drift, leading to unsafe recommendations and security violations even when intrinsic model knowledge is adequate.

HOW IT WORKS

MemEvoBench — contaminated memory pools plus biased feedback

MemEvoBench centers on Misleading Memory Injection, Noisy Tool Returns, Biased User Feedback, and a Memory Modification Tool (+ModTool) to simulate memory misevolution in realistic agents.

You can think of MemEvoBench as stress-testing an agent’s long-term memory like a corrupted hard drive that keeps getting reinforced by a biased user clicking “like” on unsafe behaviors.

This design lets MemEvoBench expose how cumulative memory drift and biased reinforcement create stable unsafe reasoning paths that a plain context window or static safety prompt cannot reveal.

DIAGRAM

Three round memory evolution with biased feedback

This diagram shows how MemEvoBench runs three related rounds per case, appending responses and biased feedback into memory to simulate misevolution.

DIAGRAM

MemEvoBench evaluation pipeline across QA and workflow styles

This diagram shows how MemEvoBench builds contaminated memory pools, runs QA and workflow tasks, and computes ASR with GPT-5.2 judging.

PROCESS

How MemEvoBench Handles a Multi Round Evaluation Session

  1. 01

    Misleading Memory Injection

    MemEvoBench first constructs QA Style cases by mixing correct and misleading entries into a hybrid memory pool for 7 domains and 36 risk types.

  2. 02

    Noisy Tool Returns

    MemEvoBench then builds Workflow Style cases from 20 AgentSafetyBench environments, encoding correct and contaminated workflows as procedural memories.

  3. 03

    Memory Evolution and Biased Feedback

    Across three rounds, MemEvoBench appends each agent response and simulated biased user feedback into the memory pool to model misevolution.

  4. 04

    Evaluation and Memory Ablation

    MemEvoBench computes ASR with GPT-5.2 judging, and compares runs with no memory, initial memory, and A-MEM to isolate memory-driven safety failures.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    MemEvoBench benchmark for contaminated memory evolution

    MemEvoBench introduces QA Style and Workflow Style scenarios spanning 7 domains, 36 risk types, and 20 tool environments, with 108 QA and 83 workflow test cases.

  • 02

    Revealing evolutionary dynamics of memory misevolution

    MemEvoBench shows average Vanilla ASR of 75.9% in Round 1 rising to 87.8% by Round 3 with biased feedback, linking accumulated memory drift to safety failures.

  • 03

    Hybrid defense with external knowledge and ModTool

    MemEvoBench equips agents with a correct_memory tool and web_search, showing Gemini-2.5-Pro QA ASR drops from 67.0% to 19.0% in Round 1 under +ModTool.

RESULTS

By the Numbers

QA Style ASR Rd.1 Vanilla avg

75.9%

+20.4 percentage points over QA Style ASR Rd.1 +SafePrompt avg 55.5%

QA Style ASR Rd.3 GPT 5 w/feedback Vanilla

78.0%

vs GPT 5 QA Style ASR Rd.3 Vanilla 59.0% without feedback

QA Style ASR Rd.1 Gemini 2.5 Pro

67.0%

drops by 48.0 percentage points to 19.0% with +ModTool

Workflow Style ASR Rd.1 Gemini 2.5 Pro

74.7%

reduced to 54.2% with +ModTool under the same round

MemEvoBench evaluates ASR on QA Style (108 cases) and Workflow Style (83 cases), showing how contaminated memory and biased feedback drive high attack success and how +ModTool can sharply reduce QA risks for some models.

BENCHMARK

By the Numbers

MemEvoBench evaluates ASR on QA Style (108 cases) and Workflow Style (83 cases), showing how contaminated memory and biased feedback drive high attack success and how +ModTool can sharply reduce QA risks for some models.

BENCHMARK

QA Style Round 1 ASR for Gemini 2.5 Pro across configurations

Attack Success Rate (%) on MemEvoBench QA Style Round 1 for Gemini-2.5-Pro under different memory safety configurations.

KEY INSIGHT

The Counterintuitive Finding

MemEvoBench shows that adding biased user feedback can push average ASR from 71.6% in Round 1 to 87.8% in Round 3, even with strong base models.

This is surprising because many assume better intrinsic knowledge guarantees safety, yet MemEvoBench reveals memory evolution and feedback loops can override that knowledge and stabilize unsafe behavior.

WHY IT MATTERS

What this unlocks for the field

MemEvoBench gives researchers a concrete way to quantify how long-term memory, noisy tools, and biased feedback jointly degrade safety across realistic agent deployments.

With MemEvoBench, builders can now design and test active memory defenses like +ModTool and A-MEM integration, targeting memory dynamics directly instead of relying solely on static safety prompts.

~13 min read← Back to papers

Related papers

Long-Term Memory

Adaptive Memory Admission Control for LLM Agents

Guilin Zhang, Wei Jiang et al.

· 2026

A-MAC scores candidate memories using Utility, Confidence, Novelty, Recency, and Type Prior combined by a learned linear admission policy with Algorithm 1 A-MAC Memory Admission. On the LoCoMo benchmark, A-MAC achieves F1 0.583 and 2644 ms latency, improving F1 by 0.042 and reducing latency by 1187 ms compared to A-mem.

Agent MemoryLong-Term Memory

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu, Liuyi Yao et al.

arXiv 2026 · 2026

Agentic Memory (AgeMem) exposes memory management tools, a three-stage progressive RL strategy, and step-wise GRPO directly inside the agent policy to jointly control long-term and short-term memory. On Qwen3-4B-Instruct, AgeMem attains 54.31% average performance across ALFWorld, SciWorld, PDDL, BabyAI, and HotpotQA, exceeding the best baseline A-Mem at 45.74%.

Questions about this paper?

Paper: MemEvoBench: Benchmarking Memory MisEvolution in LLM Agents

Answers use this explainer on Memory Papers.

Checking…