Memp : Exploring Agent Procedural Memory

AuthorsRunnan Fang, Yuan Liang, Xiaobin Wang et al.

2025

TL;DR

Memp uses Build–Retrieve–Update procedural memories distilled from trajectories to cut ALFWorld test steps from 23.76 to 15.01 while raising success from 42.14% to 77.86%.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Agents waste steps without reusable procedural memory (steps ↓50%, accuracy ↑50%)

Memp targets agents whose procedural memory is brittle, manually engineered, or entangled in static parameters, leading to slow, inaccurate multi step execution.

When TravelPlanner and ALFWorld tasks repeat structural patterns, agents still restart from scratch, wasting tokens and failing to reuse skills, tool sequences, and recovery tactics.

HOW IT WORKS

Memp — Build, Retrieve, and Update procedural memory

Memp centers on Build, Retrieve, and Update modules that transform trajectories into scripts, trajectories, and combined Proceduralization stored in a procedural memory library.

You can think of Memp like a layered cache: raw trajectories are disk logs, abstract scripts are indexed functions, and retrieval is a semantic card catalog over prior runs.

This design lets Memp inject distilled procedural knowledge into πmp(at|st), enabling continual skill reuse far beyond what a plain context window or static prompt templates can support.

DIAGRAM

Online interaction and memory update loop in Memp

This diagram shows how Memp uses trajectories and rewards to build, retrieve, and update procedural memory across sequential tasks.

DIAGRAM

Evaluation design for Build, Retrieve, and Update policies

This diagram shows how Memp evaluates different Build, Retrieve, and Update strategies on TravelPlanner and ALFWorld.

PROCESS

How Memp Handles a Multi Task Agent Session

  1. 01

    Build

    In Build, Memp applies the builder B to each task trajectory τ and reward r, creating m_pt and aggregating them into the procedural memory library Mem.

  2. 02

    Retrieve

    In Retrieve, Memp encodes the new task t_new with ϕ and selects m_retrieved via cosine similarity using Key Query or Key AveFact strategies.

  3. 03

    Update

    In Update, Memp applies U = Add ⊖ Del ⊕ Update using Vanilla Memory Update, Validation, or Adjustment based on execution feedback E(t).

  4. 04

    Proceduralization

    In Proceduralization, Memp combines full trajectories with high level scripts, then feeds this hybrid procedural memory into the agent policy πmp(at|st).

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Task agnostic procedural memory framework

    Memp formalizes procedural memory as Mem = Σ m_pt and integrates Build, Retrieve, and Update modules, turning trajectories into reusable skills across TravelPlanner and ALFWorld.

  • 02

    Systematic Build and Retrieve strategies

    Memp compares Script, Trajectory, and Proceduralization storage plus Random Sample, Key Query, and Key AveFact retrieval, showing Proceduralization with AveFact yields the strongest gains.

  • 03

    Online memory update mechanisms

    Memp introduces Vanilla Memory Update, Validation, and Adjustment, demonstrating reflexion style Adjustment yields up to +0.7 reward and 14 step reduction over other updates.

RESULTS

By the Numbers

ALFWorld Test

77.86%

+35.72 over GPT-4o No Memory (42.14%)

ALFWorld Steps

15.01 steps

-8.75 steps vs GPT-4o No Memory (23.76)

TravelPlanner #CS

79.94

+8.01 over GPT-4o No Memory (71.93)

TravelPlanner Steps

14.62 steps

-3.22 steps vs GPT-4o No Memory (17.84)

On TravelPlanner and ALFWorld, which test long horizon tool use and embodied housework, Memp’s Proceduralization with GPT-4o boosts success and cuts steps relative to the No Memory baseline. These numbers show Memp converts prior trajectories into concrete efficiency and accuracy gains for agents.

BENCHMARK

By the Numbers

On TravelPlanner and ALFWorld, which test long horizon tool use and embodied housework, Memp’s Proceduralization with GPT-4o boosts success and cuts steps relative to the No Memory baseline. These numbers show Memp converts prior trajectories into concrete efficiency and accuracy gains for agents.

BENCHMARK

ALFWorld Test performance with GPT-4o under different Build policies

Success rate (%) on ALFWorld Test for GPT-4o with No Memory, Script, Trajectory, and Proceduralization.

KEY INSIGHT

The Counterintuitive Finding

Procedural memory built by GPT-4o and transferred to Qwen2.5-14B raises TravelPlanner completion by 5% while cutting average steps by 1.6.

This is surprising because smaller models usually lag far behind, yet Memp shows a static memory bank from a stronger agent can materially boost weaker agents without retraining.

WHY IT MATTERS

What this unlocks for the field

Memp unlocks reusable, updatable procedural memory that grows across tasks, giving agents a concrete way to accumulate skills over time.

Builders can now treat trajectories as a shared procedural memory asset, distill it once with a strong agent, and plug it into weaker or specialized agents to gain efficiency and accuracy immediately.

~12 min read← Back to papers

Related papers

SurveyAgent Memory

Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

Dongming Jiang, Yi Li et al.

arXiv 2026 · 2026

Anatomy of Agentic Memory organizes agentic memory into four structures using components like Lightweight Semantic Memory, Entity-Centric and Personalized Memory, Episodic and Reflective Memory, and Structured and Hierarchical Memory. Anatomy of Agentic Memory then reports comparative results such as Nemori’s 0.781 semantic judge score on LoCoMo versus SimpleMem’s 0.298, and latency differences like 1.129s for Nemori versus 32.372s for MemoryOS.

Survey

Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks

Zexue He, Yu Wang et al.

· 2026

MEMORYARENA orchestrates Memory-Agent-Environment Loops, Multi-Session Working Flow, Bundled Web Shopping, Group Travel Planning, and Progressive Web Search to stress-test how agents store and reuse information across sessions. MEMORYARENA’s main result is that agents with near-saturated scores on long-context benchmarks like LoCoMo still obtain Task Success Rates as low as 0.00–0.12 across its four environments.

Memory ArchitectureSurvey

Multi-Agent Memory from a Computer Architecture Perspective: Visions and Challenges Ahead

Zhongming Yu, Naicheng Yu et al.

arXiv 2026 · 2026

Multi-Agent Memory Architecture organizes Agent IO Layer, Agent Cache Layer, Agent Memory Layer, Agent Cache Sharing, and Agent Memory Access Protocol into a computer-architecture-style design for LLM agents. Multi-Agent Memory Architecture’s main result is a conceptual unification of shared and distributed memory plus a research agenda for multi-agent memory consistency instead of benchmark gains.

Questions about this paper?

Paper: Memp : Exploring Agent Procedural Memory

Answers use this explainer on Memory Papers.

Checking…