MemEvolve: Meta-Evolution of Agent Memory Systems

AuthorsGuibin Zhang, Haotian Ren, Chong Zhan et al.

2025

TL;DR

MemEvolve uses a diagnose and design meta evolution over encode store retrieve manage modules to boost Flash Searcher pass@1 on xBench DS from 69.0 to 74.0.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Static memory architectures cap self evolution despite 17.06 percent gains being possible

Existing self improving agents rely on fixed memory pipelines, so memory enables evolution but the architecture itself never adapts to new tasks.

MemEvolve targets this by showing frameworks like SmolAgent and Flash Searcher can gain up to 17.06 percent when memory architectures meta evolve, avoiding static and misaligned designs.

HOW IT WORKS

MemEvolve — dual evolution over Encode Store Retrieve Manage

MemEvolve treats memory as four modular components, Encode, Store, Retrieve, and Manage, and meta evolves their programmatic implementations as a memory genotype.

You can think of MemEvolve like a student who not only writes better notes over time but also keeps redesigning their notebook layout and filing system.

This diagnose and design evolution lets MemEvolve discover architectures that provide task aware memory behavior beyond what a fixed context window or single handcrafted memory system can offer.

DIAGRAM

Dual evolution loop between trajectories and memory architectures

This diagram shows how MemEvolve alternates an inner experience evolution loop with an outer architectural evolution over candidate memory systems.

DIAGRAM

Evaluation pipeline across GAIA WebWalkerQA xBench DS and TaskCraft

This diagram shows how MemEvolve is evolved on TaskCraft and then evaluated and transferred to GAIA WebWalkerQA and xBench DS with different agent frameworks.

PROCESS

How MemEvolve Handles a Dual Evolution Iteration

  1. 01

    Inner Loop Experience Evolution

    MemEvolve runs agents with a fixed memory architecture Ω(k)_j, using Encode, Store, and Retrieve to update the memory state along trajectories and collect fj metrics.

  2. 02

    Aggregation Operator S

    MemEvolve applies the aggregation operator S to summarize trajectory level feedback into F(k)_j vectors capturing performance cost and delay for each candidate.

  3. 03

    Architectural Selection

    MemEvolve uses Pareto ranking over F(k)_j to select a Top K parent set P(k) that balances task success token cost and latency.

  4. 04

    Diagnose and Design Evolution

    MemEvolve diagnoses each parent with D(Ω(k)_p) and designs S descendants by modifying Encode, Store, Retrieve, and Manage implementations within the modular design space.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Unified Codebase EvolveLab

    MemEvolve builds on EvolveLab, which re implements twelve self improving memory systems under a shared Encode, Store, Retrieve, Manage interface and supports GAIA xBench DeepResearchBench and TaskCraft.

  • 02

    Meta Evolution Framework

    MemEvolve introduces a dual evolution process and diagnose and design evolution that meta learns memory architectures rather than fixing Encode, Store, Retrieve, and Manage by hand.

  • 03

    Experimental Evaluation

    MemEvolve improves frameworks such as SmolAgent and Flash Searcher by up to 17.06 percent and transfers memory architectures across tasks frameworks and LLM backbones like GPT 5 mini Kimi K2 and DeepSeek V3.2.

RESULTS

By the Numbers

WebWalkerQA

61.18 %

+2.36 over Smolagents GPT 5 mini

xBench DS

74.0 %

+5.0 over Flash Searcher GPT 5 mini

TaskCraft

72.00 %

+2.33 over Flash Searcher GPT 5 mini

GAIA pass@3

80.61 %

Flash Searcher MemEvolve GPT 5 mini vs 73.94 baseline

These results on WebWalkerQA xBench DeepSearch TaskCraft and GAIA show that MemEvolve consistently improves pass rates while keeping per task API cost around 0.136 to 0.141 dollars and latency comparable to other self improving memories.

BENCHMARK

By the Numbers

These results on WebWalkerQA xBench DeepSearch TaskCraft and GAIA show that MemEvolve consistently improves pass rates while keeping per task API cost around 0.136 to 0.141 dollars and latency comparable to other self improving memories.

BENCHMARK

Performance of various agent frameworks on WebWalkerQA xBench DS TaskCraft and GAIA

Average pass@1 accuracy across WebWalkerQA xBench DS TaskCraft and GAIA for Flash Searcher and MemEvolve with GPT 5 mini.

KEY INSIGHT

The Counterintuitive Finding

MemEvolve evolved on TaskCraft still improves WebWalkerQA from 58.82 to 61.18 and xBench DS from 69.0 to 74.0 without task specific meta evolution.

This is surprising because MemEvolve was motivated by the claim that no universally optimal memory architecture exists, yet a single evolved memory genotype transfers across multiple deep research benchmarks and LLM backbones.

WHY IT MATTERS

What this unlocks for the field

MemEvolve shows that agent systems can automatically discover task aware memory architectures by meta evolving Encode, Store, Retrieve, and Manage modules using interaction feedback.

Builders can now plug a MemEvolve memory genotype into diverse frameworks like SmolAgent Flash Searcher CK Pro and OWL to gain 2.0 to 17.06 percent improvements without hand tuning memory pipelines for each benchmark.

~14 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

Questions about this paper?

Paper: MemEvolve: Meta-Evolution of Agent Memory Systems

Answers use this explainer on Memory Papers.

Checking…