Memp : Exploring Agent Procedural Memory

AuthorsRunnan Fang, Yuan Liang, Xiaobin Wang et al.

2025

TL;DR

Memp uses Build–Retrieve–Update procedural memories distilled from trajectories to cut ALFWorld test steps from 23.76 to 15.01 while raising success from 42.14% to 77.86%.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Agents waste steps without reusable procedural memory (steps ↓50%, accuracy ↑50%)

Memp targets agents whose procedural memory is brittle, manually engineered, or entangled in static parameters, leading to slow, inaccurate multi step execution.

When TravelPlanner and ALFWorld tasks repeat structural patterns, agents still restart from scratch, wasting tokens and failing to reuse skills, tool sequences, and recovery tactics.

HOW IT WORKS

Memp — Build, Retrieve, and Update procedural memory

Memp centers on Build, Retrieve, and Update modules that transform trajectories into scripts, trajectories, and combined Proceduralization stored in a procedural memory library.

You can think of Memp like a layered cache: raw trajectories are disk logs, abstract scripts are indexed functions, and retrieval is a semantic card catalog over prior runs.

This design lets Memp inject distilled procedural knowledge into πmp(at|st), enabling continual skill reuse far beyond what a plain context window or static prompt templates can support.

DIAGRAM

Online interaction and memory update loop in Memp

This diagram shows how Memp uses trajectories and rewards to build, retrieve, and update procedural memory across sequential tasks.

DIAGRAM

Evaluation design for Build, Retrieve, and Update policies

This diagram shows how Memp evaluates different Build, Retrieve, and Update strategies on TravelPlanner and ALFWorld.

PROCESS

How Memp Handles a Multi Task Agent Session

01
Build
In Build, Memp applies the builder B to each task trajectory τ and reward r, creating m_pt and aggregating them into the procedural memory library Mem.
02
Retrieve
In Retrieve, Memp encodes the new task t_new with ϕ and selects m_retrieved via cosine similarity using Key Query or Key AveFact strategies.
03
Update
In Update, Memp applies U = Add ⊖ Del ⊕ Update using Vanilla Memory Update, Validation, or Adjustment based on execution feedback E(t).
04
Proceduralization
In Proceduralization, Memp combines full trajectories with high level scripts, then feeds this hybrid procedural memory into the agent policy πmp(at|st).

KEY CONTRIBUTIONS

Key Contributions

01
Task agnostic procedural memory framework
Memp formalizes procedural memory as Mem = Σ m_pt and integrates Build, Retrieve, and Update modules, turning trajectories into reusable skills across TravelPlanner and ALFWorld.
02
Systematic Build and Retrieve strategies
Memp compares Script, Trajectory, and Proceduralization storage plus Random Sample, Key Query, and Key AveFact retrieval, showing Proceduralization with AveFact yields the strongest gains.
03
Online memory update mechanisms
Memp introduces Vanilla Memory Update, Validation, and Adjustment, demonstrating reflexion style Adjustment yields up to +0.7 reward and 14 step reduction over other updates.

RESULTS

By the Numbers

ALFWorld Test

77.86%

+35.72 over GPT-4o No Memory (42.14%)

ALFWorld Steps

15.01 steps

-8.75 steps vs GPT-4o No Memory (23.76)

TravelPlanner #CS

79.94

+8.01 over GPT-4o No Memory (71.93)

TravelPlanner Steps

14.62 steps

-3.22 steps vs GPT-4o No Memory (17.84)

On TravelPlanner and ALFWorld, which test long horizon tool use and embodied housework, Memp’s Proceduralization with GPT-4o boosts success and cuts steps relative to the No Memory baseline. These numbers show Memp converts prior trajectories into concrete efficiency and accuracy gains for agents.

BENCHMARK

By the Numbers

BENCHMARK

ALFWorld Test performance with GPT-4o under different Build policies

Success rate (%) on ALFWorld Test for GPT-4o with No Memory, Script, Trajectory, and Proceduralization.

KEY INSIGHT

The Counterintuitive Finding

Procedural memory built by GPT-4o and transferred to Qwen2.5-14B raises TravelPlanner completion by 5% while cutting average steps by 1.6.

This is surprising because smaller models usually lag far behind, yet Memp shows a static memory bank from a stronger agent can materially boost weaker agents without retraining.

WHY IT MATTERS

What this unlocks for the field

Memp unlocks reusable, updatable procedural memory that grows across tasks, giving agents a concrete way to accumulate skills over time.

Builders can now treat trajectories as a shared procedural memory asset, distill it once with a strong agent, and plug it into weaker or specialized agents to gain efficiency and accuracy immediately.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…