Benchmark Agent Memory Memory Architecture

MemEvolve: Meta-Evolution of Agent Memory Systems

AuthorsGuibin Zhang, Haotian Ren, Chong Zhan et al.

2025

TL;DR

MemEvolve uses a diagnose and design meta evolution over encode store retrieve manage modules to boost Flash Searcher pass@1 on xBench DS from 69.0 to 74.0.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Static memory architectures cap self evolution despite 17.06 percent gains being possible

Existing self improving agents rely on fixed memory pipelines, so memory enables evolution but the architecture itself never adapts to new tasks.

MemEvolve targets this by showing frameworks like SmolAgent and Flash Searcher can gain up to 17.06 percent when memory architectures meta evolve, avoiding static and misaligned designs.

HOW IT WORKS

MemEvolve — dual evolution over Encode Store Retrieve Manage

MemEvolve treats memory as four modular components, Encode, Store, Retrieve, and Manage, and meta evolves their programmatic implementations as a memory genotype.

You can think of MemEvolve like a student who not only writes better notes over time but also keeps redesigning their notebook layout and filing system.

This diagnose and design evolution lets MemEvolve discover architectures that provide task aware memory behavior beyond what a fixed context window or single handcrafted memory system can offer.

DIAGRAM

Dual evolution loop between trajectories and memory architectures

This diagram shows how MemEvolve alternates an inner experience evolution loop with an outer architectural evolution over candidate memory systems.

DIAGRAM

Evaluation pipeline across GAIA WebWalkerQA xBench DS and TaskCraft

This diagram shows how MemEvolve is evolved on TaskCraft and then evaluated and transferred to GAIA WebWalkerQA and xBench DS with different agent frameworks.

PROCESS

How MemEvolve Handles a Dual Evolution Iteration

01
Inner Loop Experience Evolution
MemEvolve runs agents with a fixed memory architecture Ω(k)_j, using Encode, Store, and Retrieve to update the memory state along trajectories and collect fj metrics.
02
Aggregation Operator S
MemEvolve applies the aggregation operator S to summarize trajectory level feedback into F(k)_j vectors capturing performance cost and delay for each candidate.
03
Architectural Selection
MemEvolve uses Pareto ranking over F(k)_j to select a Top K parent set P(k) that balances task success token cost and latency.
04
Diagnose and Design Evolution
MemEvolve diagnoses each parent with D(Ω(k)_p) and designs S descendants by modifying Encode, Store, Retrieve, and Manage implementations within the modular design space.

KEY CONTRIBUTIONS

Key Contributions

01
Unified Codebase EvolveLab
MemEvolve builds on EvolveLab, which re implements twelve self improving memory systems under a shared Encode, Store, Retrieve, Manage interface and supports GAIA xBench DeepResearchBench and TaskCraft.
02
Meta Evolution Framework
MemEvolve introduces a dual evolution process and diagnose and design evolution that meta learns memory architectures rather than fixing Encode, Store, Retrieve, and Manage by hand.
03
Experimental Evaluation
MemEvolve improves frameworks such as SmolAgent and Flash Searcher by up to 17.06 percent and transfers memory architectures across tasks frameworks and LLM backbones like GPT 5 mini Kimi K2 and DeepSeek V3.2.

RESULTS

By the Numbers

WebWalkerQA

61.18 %

+2.36 over Smolagents GPT 5 mini

xBench DS

74.0 %

+5.0 over Flash Searcher GPT 5 mini

TaskCraft

72.00 %

+2.33 over Flash Searcher GPT 5 mini

GAIA pass@3

80.61 %

Flash Searcher MemEvolve GPT 5 mini vs 73.94 baseline

These results on WebWalkerQA xBench DeepSearch TaskCraft and GAIA show that MemEvolve consistently improves pass rates while keeping per task API cost around 0.136 to 0.141 dollars and latency comparable to other self improving memories.

BENCHMARK

By the Numbers

BENCHMARK

Performance of various agent frameworks on WebWalkerQA xBench DS TaskCraft and GAIA

Average pass@1 accuracy across WebWalkerQA xBench DS TaskCraft and GAIA for Flash Searcher and MemEvolve with GPT 5 mini.

KEY INSIGHT

The Counterintuitive Finding

MemEvolve evolved on TaskCraft still improves WebWalkerQA from 58.82 to 61.18 and xBench DS from 69.0 to 74.0 without task specific meta evolution.

This is surprising because MemEvolve was motivated by the claim that no universally optimal memory architecture exists, yet a single evolved memory genotype transfers across multiple deep research benchmarks and LLM backbones.

WHY IT MATTERS

What this unlocks for the field

MemEvolve shows that agent systems can automatically discover task aware memory architectures by meta evolving Encode, Store, Retrieve, and Manage modules using interaction feedback.

Builders can now plug a MemEvolve memory genotype into diverse frameworks like SmolAgent Flash Searcher CK Pro and OWL to gain 2.0 to 17.06 percent improvements without hand tuning memory pipelines for each benchmark.

~14 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…