Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution

AuthorsZouying Cao, Jiaji Deng, Li Yu et al.

2025

TL;DR

ReMe uses multi-faceted experience distillation plus utility-based refinement to turn procedural memory into a self-evolving pool, letting Qwen3-8B + ReMe (dynamic) reach 55.03% Avg@4 vs 46.20% without memory (+8.83 points).

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Static procedural memories become noisy passive archives

Procedural memory in many LLM agents follows a “passive accumulation” paradigm, where experiences are stored as static, append-only archives without refinement.

These static systems often mix valid insights with toxic noise, causing agents to misgeneralize in new tasks and degrading performance over time in complex agentic environments.

HOW IT WORKS

ReMe — multi-faceted distillation, adaptive reuse, and utility-based refinement

ReMe centers on experience acquisition, experience reuse, and experience refinement, combining multi-faceted distillation, scenario-aware indexing, and utility-based deletion into one closed-loop procedural memory lifecycle.

You can think of ReMe like a brain that not only remembers skills but also rewrites and prunes them, more like a constantly edited playbook than a static log file.

This design lets ReMe maintain a compact, high-utility experience pool that adapts to shifting tasks, enabling behaviors a plain context window or raw trajectory store cannot support.

DIAGRAM

Experience reuse pipeline for a new task

This diagram shows how ReMe retrieves, reranks, rewrites, and applies experiences during experience reuse for a single task query.

DIAGRAM

Evaluation and ablation design for ReMe

This diagram shows how ReMe is evaluated on BFCL-V3 and AppWorld, including fixed vs dynamic settings and key ablations.

PROCESS

How ReMe Handles a Task — the procedural memory lifecycle

01
Experience Acquisition
ReMe samples multiple execution trajectories with LLMexecute, then uses multi-faceted distillation in experience acquisition to build structured experiences with scenarios, content, keywords, confidence, and tools.
02
Experience Reuse
Given a new task query, ReMe performs experience reuse by retrieving top K experiences via usage scenario indexing, optional reranking, and adaptive rewriting into task specific guidance.
03
Experience Refinement
After task execution, ReMe runs experience refinement with selective addition from successful trajectories, failure aware reflection for retries, and utility based deletion using retrieval and success statistics.
04
Experience driven Inference
Throughout the lifecycle, ReMe enables experience driven inference where LLMexecute reasons with distilled experiences, progressively evolving its procedural memory across BFCL V3 and AppWorld tasks.

KEY CONTRIBUTIONS

Key Contributions

01
ReMe framework for agent evolution
ReMe integrates experience acquisition, experience reuse, and experience refinement into a closed loop, enabling agents to autonomously distill and maintain high quality procedural experiences across tasks.
02
reme.library procedural memory dataset
ReMe introduces reme.library, a fine grained procedural memory dataset with structured success patterns and failure lessons distilled from diverse agentic tasks for studying procedural memory.
03
Memory scaling effect across Qwen3 models
ReMe shows that Qwen3-8B + ReMe (dynamic) reaches 55.03% Pass@4 vs 54.65% for Qwen3-14B without memory, and Qwen3-14B + ReMe (dynamic) surpasses Qwen3-32B without memory on Avg@4 and Pass@4.

RESULTS

By the Numbers

Avg@4

34.94% Avg@4

+7.29 over No Memory (Qwen3-8B average Avg@4 27.65%)

Pass@4

55.03% Pass@4

+8.83 over No Memory (Qwen3-8B average Pass@4 46.20%)

BFCL-V3 Avg@4

45.17% Avg@4

+4.84 over No Memory on BFCL-V3 with Qwen3-8B (40.33%)

AppWorld Pass@4

42.06% Pass@4

+9.21 over No Memory on AppWorld with Qwen3-8B (32.85%)

ReMe is evaluated on BFCL-V3 and AppWorld, both tool-augmented multi-turn benchmarks testing complex agentic tasks. The gains in Avg@4 and Pass@4 show that ReMe’s self-evolving procedural memory substantially improves success probability over No Memory baselines.

BENCHMARK

By the Numbers

BENCHMARK

Performance comparison on BFCL-V3 with Qwen3-8B

Avg@4 on BFCL-V3 for Qwen3-8B with different memory systems.

BENCHMARK

Extraction granularity ablation on BFCL-V3

Avg@4 for Qwen3-8B with trajectory-level vs keypoint-level experience acquisition in ReMe (fixed).

KEY INSIGHT

The Counterintuitive Finding

ReMe enables Qwen3-8B + ReMe (dynamic) to reach 55.03% Pass@4, slightly above Qwen3-14B without memory at 54.65% Pass@4.

This is surprising because larger Qwen3-14B is expected to dominate, but ReMe shows that high quality procedural memory can offset model size advantages.

WHY IT MATTERS

What this unlocks for the field

ReMe unlocks self-evolving procedural memory where agents continuously distill, adapt, and prune experiences instead of hoarding static trajectories.

Builders can now deploy smaller backbones with ReMe to achieve or surpass larger No Memory systems, making long horizon, tool-augmented agents more practical under tight compute budgets.

~14 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…