Episodic Memory

Episodic memory in AI agents — storing, retrieving, and reasoning over past experiences and events.

8 papers

BenchmarkBenchmark

APEX-EM: Non-Parametric Online Learning for Autonomous Agents via Structured Procedural-Episodic Experience Replay

Pratyay Banerjee, Masud Moshtaghi, Ankit Chadha

· 2026

APEX-EM combines a Procedural Knowledge Graph, Experience Memory store, PRGII workflow, Task Verifiers, and StructuralSignatureExtractor to store and reuse full procedural-episodic traces without changing model weights. On KGQAGen-10k, APEX-EM reaches 89.6% accuracy (95.3% CSR) versus 41.3% without memory and surpasses the GPT-4o w/ SP oracle at 84.9%.

arXiv:2603.29093 Read explainer

BenchmarkBenchmark

Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents

Yuxuan Cai, Jie Zhou et al.

· 2026

PROACTAGENT combines Experience-Enhanced Online Evolution (EXPONEVO), a structured EXPERIENCE BASE, and Proactive Reinforcement Learning-based Retrieval (PROACTRL) to jointly evolve memory and policy with retrieval as an explicit action. On SciWorld, PROACTAGENT reaches 73.50% SR versus 55.50% for GRPO+Reflexion, while cutting interaction rounds from 27.52 to 18.38.

arXiv:2604.20572 Read explainer

BenchmarkAgent MemoryLong-Term Memory

Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents

Benjamin Stern, Peter Nadel

· 2026

Drawing on Memory uses dual-trace memory encoding, an evidence scoring gate, and a three-state retrieval protocol to store paired fact and scene traces in Letta’s archival memory. On LongMemEval-S, Drawing on Memory reaches 73.7% accuracy versus 53.5% for the fact-only C7-control baseline, a +20.2 percentage point gain concentrated in temporal, update, and multi-session questions.

arXiv:2604.12948 Read explainer

BenchmarkAgent Memory

Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents

Xing Zhang, Guanghui Wang et al.

· 2026

Experience Compression Spectrum organizes Level 0 Raw Trace, Level 1 Episodic Memory, Level 2 Procedural Skill, and Level 3 Declarative Rule into a unified scaffold-level compression framework. Experience Compression Spectrum’s mapping of 20+ systems and <1% cross-citation rate shows that all existing agents fix a single compression level and never perform adaptive cross-level compression.

arXiv:2604.15877 Read explainer

RAGBenchmarkBenchmarkBenchmarkAgent MemoryLong-Term MemoryMemory Architecture

Evaluating Long-Term Memory for Long-Context Question Answering

Alessandra Terranova, Björn Ross, Alexandra Birch

· 2025

Evaluating Long-Term Memory for Long-Context Question Answering compares Full Context, RAG, A-Mem, RAG+PromptOpt, and RAG+EpMem memory components across semantic, episodic, and procedural memory for long conversational QA. On LoCoMo, RAG+EpMem reaches an average F1 ranking of 1.83 for Llama 3.2-3B Instruct and 1.80 for GPT-4o mini while using around 1,000 tokens per query versus over 23,000 for Full Context.

arXiv:2510.23730 Read explainer

RAGBenchmarkBenchmarkMemory Architecture

Learning from Supervision with Semantic and Episodic Memory: A Reflective Approach to Agent Adaptation

Jackson Hassell, Dan Zhang et al.

· 2025

Learning from Supervision with Semantic and Episodic Memory combines a performance agent, critic agent, semantic memory, episodic memory, and memory retriever to turn label-grounded critiques into reusable supervision without parameter updates. On the Multi-Condition Ranking dataset with Mixtral 8x22B and o4-mini as critic, Learning from Supervision with Semantic and Episodic Memory reaches 85.6% accuracy, a 24.8% gain over the EP_LABEL baseline at 60.8%.

arXiv:2510.19897 Read explainer

BenchmarkBenchmarkLong-Term Memory

Pre-Storage Reasoning for Episodic Memory: Shifting Inference Burden to Memory for Personalized Dialogue

Sangyeop Kim, Yohan Lee et al.

· 2025

PREMem builds long term dialogue memory by combining Episodic Memory Extraction, Pre Storage Memory Reasoning, semantic clustering, a persistent memory pool, and an inference phase over enriched memory fragments. PREMem reaches 71.4 LLM as a judge on LongMemEval with gpt 4.1 base, a +15.5 gain over HippoRAG 2 and +9.6 over A Mem.

arXiv:2509.10852 Read explainer

BenchmarkBenchmarkMemory Architecture

WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

Woongyeong Yeo, Kangsan Kim et al.

· 2025

WorldMM dynamically coordinates Episodic Memory, Semantic Memory, Visual Memory, an Adaptive Retrieval Agent, and a Response Agent to answer queries over hour- to week-long videos. On five long video QA benchmarks, WorldMM-GPT reaches 69.5% average accuracy, beating M3-Agent’s 55.1% by 14.4 points and the best prior memory baseline HippoRAG’s 57.0% by 12.5 points.

arXiv:2512.02425 Read explainer