Procedural Memory

Procedural memory in LLM agents — learning skills, rules, and how-to knowledge from experience.

4 papers

BenchmarkBenchmark

APEX-EM: Non-Parametric Online Learning for Autonomous Agents via Structured Procedural-Episodic Experience Replay

Pratyay Banerjee, Masud Moshtaghi, Ankit Chadha

· 2026

APEX-EM combines a Procedural Knowledge Graph, Experience Memory store, PRGII workflow, Task Verifiers, and StructuralSignatureExtractor to store and reuse full procedural-episodic traces without changing model weights. On KGQAGen-10k, APEX-EM reaches 89.6% accuracy (95.3% CSR) versus 41.3% without memory and surpasses the GPT-4o w/ SP oracle at 84.9%.

arXiv:2603.29093 Read explainer

RAGBenchmarkBenchmarkBenchmarkAgent MemoryLong-Term MemoryMemory Architecture

Evaluating Long-Term Memory for Long-Context Question Answering

Alessandra Terranova, Björn Ross, Alexandra Birch

· 2025

Evaluating Long-Term Memory for Long-Context Question Answering compares Full Context, RAG, A-Mem, RAG+PromptOpt, and RAG+EpMem memory components across semantic, episodic, and procedural memory for long conversational QA. On LoCoMo, RAG+EpMem reaches an average F1 ranking of 1.83 for Llama 3.2-3B Instruct and 1.80 for GPT-4o mini while using around 1,000 tokens per query versus over 23,000 for Full Context.

arXiv:2510.23730 Read explainer

Benchmark

Memp : Exploring Agent Procedural Memory

Runnan Fang, Yuan Liang et al.

· 2025

Memp constructs agent skills via Build, Retrieve, and Update modules that turn past trajectories into scripts, trajectories, and combined proceduralizations stored in a procedural memory library. On ALFWorld, Memp’s proceduralization with GPT-4o reaches 77.86% test success versus 42.14% with no memory, while reducing steps from 23.76 to 15.01.

arXiv:2508.06433 Read explainer

Benchmark

Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution

Zouying Cao, Jiaji Deng et al.

· 2025

ReMe manages procedural memory through experience acquisition, experience reuse, and experience refinement, combining multi-faceted distillation, context-adaptive reuse, and utility-based deletion into a single lifecycle. On BFCL-V3 and AppWorld, Qwen3-8B with ReMe (dynamic) achieves 34.94% Avg@4 vs 27.65% for the No Memory baseline, and 55.03% Pass@4 vs 46.20%, showing that self-evolving memory can substitute for model scale.

arXiv:2512.10696 Read explainer