Category

Episodic Memory

Episodic memory in AI agents — storing, retrieving, and reasoning over past experiences and events.

28 papers

BenchmarkBenchmark

APEX-EM: Non-Parametric Online Learning for Autonomous Agents via Structured Procedural-Episodic Experience Replay

Pratyay Banerjee, Masud Moshtaghi, Ankit Chadha

· 2026

APEX-EM combines a Procedural Knowledge Graph, Experience Memory store, PRGII workflow, Task Verifiers, and StructuralSignatureExtractor to store and reuse full procedural-episodic traces without changing model weights. On KGQAGen-10k, APEX-EM reaches 89.6% accuracy (95.3% CSR) versus 41.3% without memory and surpasses the GPT-4o w/ SP oracle at 84.9%.

BenchmarkBenchmark

Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents

Yuxuan Cai, Jie Zhou et al.

· 2026

PROACTAGENT combines Experience-Enhanced Online Evolution (EXPONEVO), a structured EXPERIENCE BASE, and Proactive Reinforcement Learning-based Retrieval (PROACTRL) to jointly evolve memory and policy with retrieval as an explicit action. On SciWorld, PROACTAGENT reaches 73.50% SR versus 55.50% for GRPO+Reflexion, while cutting interaction rounds from 27.52 to 18.38.

BenchmarkAgent MemoryLong-Term Memory

Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents

Benjamin Stern, Peter Nadel

· 2026

Drawing on Memory uses dual-trace memory encoding, an evidence scoring gate, and a three-state retrieval protocol to store paired fact and scene traces in Letta’s archival memory. On LongMemEval-S, Drawing on Memory reaches 73.7% accuracy versus 53.5% for the fact-only C7-control baseline, a +20.2 percentage point gain concentrated in temporal, update, and multi-session questions.

BenchmarkAgent Memory

Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents

Xing Zhang, Guanghui Wang et al.

· 2026

Experience Compression Spectrum organizes Level 0 Raw Trace, Level 1 Episodic Memory, Level 2 Procedural Skill, and Level 3 Declarative Rule into a unified scaffold-level compression framework. Experience Compression Spectrum’s mapping of 20+ systems and <1% cross-citation rate shows that all existing agents fix a single compression level and never perform adaptive cross-level compression.

RAGBenchmarkBenchmarkBenchmarkAgent MemoryLong-Term MemoryMemory Architecture

Evaluating Long-Term Memory for Long-Context Question Answering

Alessandra Terranova, Björn Ross, Alexandra Birch

· 2025

Evaluating Long-Term Memory for Long-Context Question Answering compares Full Context, RAG, A-Mem, RAG+PromptOpt, and RAG+EpMem memory components across semantic, episodic, and procedural memory for long conversational QA. On LoCoMo, RAG+EpMem reaches an average F1 ranking of 1.83 for Llama 3.2-3B Instruct and 1.80 for GPT-4o mini while using around 1,000 tokens per query versus over 23,000 for Full Context.

Benchmark

Latent learning: episodic memory complements parametric learning by enabling flexible reuse of experiences

Andrew Kyle Lampinen, Martin Engelcke et al.

· 2025

Latent learning uses oracle episodic retrieval, within-experience in-context learning, parametric learning, and latent learning benchmarks to study how reinstating full past episodes into context changes generalization. Latent learning shows that oracle retrieval solves latent tests like reversals and codebooks where parametric-only transformers stay near 0% despite strong performance on forward and in-context variants.

RAGBenchmarkBenchmarkMemory Architecture

Learning from Supervision with Semantic and Episodic Memory: A Reflective Approach to Agent Adaptation

Jackson Hassell, Dan Zhang et al.

· 2025

Learning from Supervision with Semantic and Episodic Memory combines a performance agent, critic agent, semantic memory, episodic memory, and memory retriever to turn label-grounded critiques into reusable supervision without parameter updates. On the Multi-Condition Ranking dataset with Mixtral 8x22B and o4-mini as critic, Learning from Supervision with Semantic and Episodic Memory reaches 85.6% accuracy, a 24.8% gain over the EP_LABEL baseline at 60.8%.

Benchmark

Memento: Fine-tuning LLM Agents without Fine-tuning LLMs

Huichi Zhou, Yihang Chen et al.

arXiv 2025 · 2025

Memento combines a Planner, Executor, Case Memory, Subtask Memory, and Tool Memory inside a memory-based MDP, learning a neural case-selection policy over a growing Case Bank instead of updating LLM weights. On GAIA, Memento achieves 87.88% Pass@3 on validation and 79.40% on the test set, surpassing DeepResearcher’s 51.8% F1 and 60.5% PM on DeepResearcher benchmarks by +14.8 F1 and +19.9 PM.

Benchmark

Position: Episodic Memory is the Missing Piece for Long-Term LLM Agents

Mathis Pink, Qinyuan Wu et al.

· 2025

Episodic Memory is the Missing Piece for Long-Term LLM Agents proposes an architecture where in-context memory, external memory, and parametric memory are coordinated via consolidation, encoding, and retrieval to realize long-term, instance-specific, contextual episodic traces. Episodic Memory is the Missing Piece for Long-Term LLM Agents contributes a five-property taxonomy, a three-way memory categorization (in-context, external, parametric), and a roadmap of six research questions instead of benchmark gains.

BenchmarkBenchmarkLong-Term Memory

Pre-Storage Reasoning for Episodic Memory: Shifting Inference Burden to Memory for Personalized Dialogue

Sangyeop Kim, Yohan Lee et al.

· 2025

PREMem builds long term dialogue memory by combining Episodic Memory Extraction, Pre Storage Memory Reasoning, semantic clustering, a persistent memory pool, and an inference phase over enriched memory fragments. PREMem reaches 71.4 LLM as a judge on LongMemEval with gpt 4.1 base, a +15.5 gain over HippoRAG 2 and +9.6 over A Mem.

BenchmarkBenchmarkMemory Architecture

WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

Woongyeong Yeo, Kangsan Kim et al.

· 2025

WorldMM dynamically coordinates Episodic Memory, Semantic Memory, Visual Memory, an Adaptive Retrieval Agent, and a Response Agent to answer queries over hour- to week-long videos. On five long video QA benchmarks, WorldMM-GPT reaches 69.5% average accuracy, beating M3-Agent’s 55.1% by 14.4 points and the best prior memory baseline HippoRAG’s 57.0% by 12.5 points.

Benchmark

AriGraph: Learning Knowledge Graph World Models with Episodic Memory for LLM Agents

Petr Anokhin, Nikita Semenov et al.

· 2024

AriGraph builds a joint world model by combining semantic memory, episodic memory, semantic search, episodic search, and the Ariadne cognitive architecture into a single evolving graph. On NetHack, AriGraph lets Ariadne reach a score of 593.00 with room-only observations, compared to 341.67 for NetPlay with the same input and 675.33 for NetPlay with full level observations.

Benchmark

Efficient Episodic Memory Utilization of Cooperative Multi-Agent Reinforcement Learning

Hyungho Na, Yunkyeong Seo, Il-chul Moon

· 2024

EMU combines a semantic memory embedding via deterministic conditional autoencoder and an episodic incentive built from desirability in the episodic buffer to guide cooperative MARL. EMU improves learning speed and final win-rates on StarCraft II SMAC and Google Research Football compared to QPLEX, CDS, and EMC, especially on super hard maps.

Benchmark

Human-inspired Episodic Memory for Infinite Context LLMs

Zafeirios Fountas, Martin A Benfeghoul et al.

· 2024

EM-LLM organises long token streams into episodic events via surprise-based segmentation, boundary refinement, and a two-stage memory retrieval with similarity and contiguity buffers. On LongBench with LLaMA-3.1-8B, EM-LLM reaches 51.58% average score versus 39.3% for full-context processing and 36.44% for RAG.

Benchmark

Larimar: Large Language Models with Episodic Memory Control

Payel Das, Subhajit Chaudhury et al.

· 2024

Larimar couples a BERT large encoder, a deterministic hierarchical memory module, a scope detector, and a GPT2-large or GPTJ-6B decoder via learned projection WM to perform memory-conditioned generation. On CounterFact and ZsRE, Larimar attains up to 100.0% edit success and 0.97 edit retention rate while being 4–10× faster than ROME and GRACE.

Benchmark

Empowering Working Memory for Large Language Model Agents

Jing Guo, Nan Li et al.

· 2023

Empowering Working Memory for Large Language Model Agents introduces a Working Memory Hub, Episodic Buffer, Interaction History Window, Central Processor, and External Environment Interface to give LLM agents persistent, structured working and episodic memory. Empowering Working Memory for Large Language Model Agents is a conceptual blueprint rather than a benchmarked system, so no quantitative MAIN_RESULT against specific baselines is reported.

Benchmark

Episodic Memory Question Answering

Samyak Datta, Sameer Dharur et al.

· 2022

Episodic Memory Question Answering combines allocentric top-down semantic features, spatiotemporal memory, and a LingUNet-based question-answering model to localize answers on scene floorplans from egocentric RGB-D tours. On the EMQA benchmark built from Matterport3D tours, Episodic Memory Question Answering with temporal features achieves 29.11 IoU and 62.27 recall on top-down maps, improving over SMNetDecoder by 2.19 IoU and 18.41 recall.

Benchmark

Generalizable Episodic Memory for Deep Reinforcement Learning

Hao Hu, Jianing Ye et al.

arXiv 2021 · 2021

Generalizable Episodic Memory (GEM) combines a parametric network Mθ, implicit memory-based planning, twin back-propagation process, and conservative estimation on single step to generalize episodic returns across trajectories. GEM achieves higher average returns than TD3, SAC, DDPG, and TD3+SIL on MuJoCo tasks like Ant-v2, HalfCheetah-v2, and Humanoid-v2 within 1M environment steps.

Benchmark

Offline Reinforcement Learning with Value-based Episodic Memory

Xiaoteng Ma, Yiqin Yang et al.

arXiv 2021 · 2021

Value-based Episodic Memory (VEM) uses Expectile V-Learning (EVL), Implicit Memory-based Planning, and Generalized Advantage-weighted Learning to learn conservative V-values and plan along offline trajectories. On the D4RL benchmark, VEM attains 87.5 on antmaze-umaze and 128.3 on adroit-hammer-expert, improving over BAIL and CQL on most AntMaze and Adroit tasks.

Benchmark

Solving Continuous Control with Episodic Memory

Igor Kuznetsov, Andrey Filchenkov

arXiv 2021 · 2021

Episodic Memory Actor Critic (EMAC) augments DDPG with a Memory Module, Episodic-based Experience Replay Prioritization, and a modified critic objective that blends Bellman targets with retrieved Monte Carlo returns. On OpenAI Gym continuous control, EMAC reaches 2236.88 ± 808 average return on Walker2d-v3, beating TD3 (1008.32) by +1228.56 and SAC (1787.28) by +449.6 in the 200000-step small-data regime.

Benchmark

Episodic Memory in Lifelong Language Learning

Cyprien de Masson d'Autume, Sebastian Ruder et al.

arXiv 2019 · 2019

Episodic Memory in Lifelong Language Learning combines an example encoder, task decoder, and episodic memory for sparse experience replay and Memory-based Parameter Adaptation (MBPA++). Episodic Memory in Lifelong Language Learning attains 70.6 averaged classification accuracy vs 66.9 for A-GEM and 62.4 QA F1 vs 57.9 for REPLAY on concatenated multi-dataset streams.

Benchmark

Episodic Memory Reader: Learning What to Remember for Question Answering from Streaming Data

Moonsu Han, Minki Kang et al.

arXiv 2019 · 2019

Episodic Memory Reader combines a Data Encoder, Memory Encoder (EMR-Independent, EMR-biGRU, EMR-Transformer), Value Network, external memory, and QA solver to learn which streaming items to keep. On TriviaQA, Episodic Memory Reader with EMR-biGRU achieves 57.57 F1 versus 50.10 for LIFO, and on TVQA Episodic Memory Reader reaches about 65% accuracy with 60 memory entries versus roughly 55% for LIFO.

Benchmark

Generalization of Reinforcement Learners with Working and Episodic Memory

Meire Fortunato, Melissa Tan et al.

arXiv 2019 · 2019

Memory Recall Agent (MRA) integrates a pixel‑input convolutional residual network, an LSTM working memory, a slot‑based episodic memory, an auxiliary contrastive loss, and jumpy backpropagation into a single reinforcement learning agent. Across the Memory Tasks Suite with train/holdout splits, Memory Recall Agent (MRA) attains the highest average human‑normalized performance compared to LSTM‑only IMPALA and other ablations, especially on tasks requiring episodic recall.

Benchmark

On Tiny Episodic Memories in Continual Learning

Arslan Chaudhry, Marcus Rohrbach et al.

arXiv 2019 · 2019

On Tiny Episodic Memories in Continual Learning combines Experience Replay, Reservoir Sampling, Ring Buffer, k-Means, and Mean of Features memory writing to jointly train on current-task data and a tiny episodic memory. On Split CIFAR with only 1 example per class, On Tiny Episodic Memories in Continual Learning reaches about 0.56 average accuracy, a +15.6 percentage point gain over FINETUNE and +15 percentage points over EWC.

Benchmark

Deep Episodic Memory: Encoding, Recalling, and Predicting Episodic Experiences for Robot Action Execution

Jonas Rothfuss, Fabio Ferreira et al.

arXiv 2018 · 2018

Deep Episodic Memory uses an encoder network E, reconstruction-decoder Dr, prediction-decoder Dp, latent vector V, and a matching and retrieval mechanism to turn raw video into episodic encodings that can be reconstructed and predicted. On ActivityNet, Deep Episodic Memory with PCA achieves 45.55% first-match precision versus 32.31% for ResNet-50 Fisher Vectors, a +13.24 percentage point gain.

Benchmark

Episodic Memory Deep Q-Networks

Zichuan Lin, Tianqi Zhao et al.

arXiv 2018 · 2018

Episodic Memory Deep Q-Networks (EMDQN) augments Qθ(s, a) with an inference target S, an episodic memory target H, and a memory table built via random projection and kd-tree lookup. On 57 Atari games at 40M frames, EMDQN achieves a 528.4% mean human-normalized score versus 151.2% for DQN and 144.8% for NEC.

Benchmark

Gradient Episodic Memory for Continual Learning

David Lopez-Paz, Marc'Aurelio Ranzato

arXiv 2017 · 2017

Gradient Episodic Memory stores task examples in episodic memory Mt, constrains updates via inequality constraints on past-task losses, and solves a small quadratic program (GEM QP) to project gradients. Gradient Episodic Memory achieves 0.654 ACC on Incremental CIFAR100 with 5,120 memory slots, compared to 0.508 ACC for iCARL.