Benchmark Benchmark Benchmark Agent Memory Long-Term Memory

MemEvoBench: Benchmarking Memory MisEvolution in LLM Agents

AuthorsWeiwei Xie, Shaoxiong Guo, Fan Zhang et al.

2026

TL;DR

MemEvoBench uses multi-round biased-memory interactions plus a memory correction tool to show ASR often exceeds 80% and that active memory editing can cut QA-style risks for Gemini-2.5-Pro from 67.0% to 19.0% in Round 1.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Memory misevolution drives ASR up to 96.3 percent under biased feedback

MemEvoBench shows that under Vanilla prompting, average Attack Success Rate reaches 75.9% in Round 1 and climbs to 87.8% by Round 3 with biased feedback.

In long-horizon QA and workflow tasks, contaminated memory causes behavioral drift, leading to unsafe recommendations and security violations even when intrinsic model knowledge is adequate.

HOW IT WORKS

MemEvoBench — contaminated memory pools plus biased feedback

MemEvoBench centers on Misleading Memory Injection, Noisy Tool Returns, Biased User Feedback, and a Memory Modification Tool (+ModTool) to simulate memory misevolution in realistic agents.

You can think of MemEvoBench as stress-testing an agent’s long-term memory like a corrupted hard drive that keeps getting reinforced by a biased user clicking “like” on unsafe behaviors.

This design lets MemEvoBench expose how cumulative memory drift and biased reinforcement create stable unsafe reasoning paths that a plain context window or static safety prompt cannot reveal.

DIAGRAM

Three round memory evolution with biased feedback

This diagram shows how MemEvoBench runs three related rounds per case, appending responses and biased feedback into memory to simulate misevolution.

DIAGRAM

MemEvoBench evaluation pipeline across QA and workflow styles

This diagram shows how MemEvoBench builds contaminated memory pools, runs QA and workflow tasks, and computes ASR with GPT-5.2 judging.

PROCESS

How MemEvoBench Handles a Multi Round Evaluation Session

01
Misleading Memory Injection
MemEvoBench first constructs QA Style cases by mixing correct and misleading entries into a hybrid memory pool for 7 domains and 36 risk types.
02
Noisy Tool Returns
MemEvoBench then builds Workflow Style cases from 20 AgentSafetyBench environments, encoding correct and contaminated workflows as procedural memories.
03
Memory Evolution and Biased Feedback
Across three rounds, MemEvoBench appends each agent response and simulated biased user feedback into the memory pool to model misevolution.
04
Evaluation and Memory Ablation
MemEvoBench computes ASR with GPT-5.2 judging, and compares runs with no memory, initial memory, and A-MEM to isolate memory-driven safety failures.

KEY CONTRIBUTIONS

Key Contributions

01
MemEvoBench benchmark for contaminated memory evolution
MemEvoBench introduces QA Style and Workflow Style scenarios spanning 7 domains, 36 risk types, and 20 tool environments, with 108 QA and 83 workflow test cases.
02
Revealing evolutionary dynamics of memory misevolution
MemEvoBench shows average Vanilla ASR of 75.9% in Round 1 rising to 87.8% by Round 3 with biased feedback, linking accumulated memory drift to safety failures.
03
Hybrid defense with external knowledge and ModTool
MemEvoBench equips agents with a correct_memory tool and web_search, showing Gemini-2.5-Pro QA ASR drops from 67.0% to 19.0% in Round 1 under +ModTool.

RESULTS

By the Numbers

QA Style ASR Rd.1 Vanilla avg

75.9%

+20.4 percentage points over QA Style ASR Rd.1 +SafePrompt avg 55.5%

QA Style ASR Rd.3 GPT 5 w/feedback Vanilla

78.0%

vs GPT 5 QA Style ASR Rd.3 Vanilla 59.0% without feedback

QA Style ASR Rd.1 Gemini 2.5 Pro

67.0%

drops by 48.0 percentage points to 19.0% with +ModTool

Workflow Style ASR Rd.1 Gemini 2.5 Pro

74.7%

reduced to 54.2% with +ModTool under the same round

MemEvoBench evaluates ASR on QA Style (108 cases) and Workflow Style (83 cases), showing how contaminated memory and biased feedback drive high attack success and how +ModTool can sharply reduce QA risks for some models.

BENCHMARK

By the Numbers

BENCHMARK

QA Style Round 1 ASR for Gemini 2.5 Pro across configurations

Attack Success Rate (%) on MemEvoBench QA Style Round 1 for Gemini-2.5-Pro under different memory safety configurations.

KEY INSIGHT

The Counterintuitive Finding

MemEvoBench shows that adding biased user feedback can push average ASR from 71.6% in Round 1 to 87.8% in Round 3, even with strong base models.

This is surprising because many assume better intrinsic knowledge guarantees safety, yet MemEvoBench reveals memory evolution and feedback loops can override that knowledge and stabilize unsafe behavior.

WHY IT MATTERS

What this unlocks for the field

MemEvoBench gives researchers a concrete way to quantify how long-term memory, noisy tools, and biased feedback jointly degrade safety across realistic agent deployments.

With MemEvoBench, builders can now design and test active memory defenses like +ModTool and A-MEM integration, targeting memory dynamics directly instead of relying solely on static safety prompts.

~13 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…