Deep Episodic Memory: Encoding, Recalling, and Predicting Episodic Experiences for Robot Action Execution

AuthorsJonas Rothfuss, Fabio Ferreira, Eren Erdal Aksoy et al.

arXiv 20182018

TL;DR

Deep Episodic Memory + a convLSTM encoder with dual decoders learns subsymbolic episodic encodings and reaches 45.55% first-match precision on ActivityNet, beating ResNet-50 Fisher Vectors by 13.24 points.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Episodic memories need subsymbolic video encodings without labels

Deep Episodic Memory must learn from large video corpora like 86,017 training clips in 20BN-something-something without any class labels during training.

Without robust episodic encodings, robots cannot recall similar past actions or predict consequences, limiting planning and case-based reasoning in real-world manipulation.

HOW IT WORKS

Deep Episodic Memory — composite encoder decoder with future prediction

Deep Episodic Memory combines an encoder network E, reconstruction-decoder Dr, prediction-decoder Dp, latent vector V, and a matching and retrieval mechanism into one unsupervised system.

You can think of Deep Episodic Memory like a hippocampus that compresses video into a code, then two separate “projectors” replay the past and imagine the future.

This composite design lets Deep Episodic Memory reconstruct episodes, predict future frames, and compare actions in latent space, which a plain context window or classifier cannot.

DIAGRAM

Episodic encoding and retrieval flow

This diagram shows how Deep Episodic Memory encodes a query video, retrieves similar episodes via cosine similarity in latent space, and returns matched episodes.

DIAGRAM

Training pipeline on ActivityNet and 20BN

This diagram shows how Deep Episodic Memory is trained on ActivityNet and 20BN with reconstruction and future prediction losses.

PROCESS

How Deep Episodic Memory Handles an Action Episode

  1. 01

    The Neural Network Model

    Deep Episodic Memory uses encoder network E with convLSTM and conv layers to process Xr and produce the latent vector V = hk∥ck.

  2. 02

    Matching Visual Experiences in the Latent Space

    Deep Episodic Memory stores latent vectors Vi and compares a query Vq to Vi using cosine similarity to retrieve the n most similar episodes.

  3. 03

    Frame Reconstruction and Future Frame Prediction

    Deep Episodic Memory forwards V to reconstruction-decoder Dr to generate Yr and to prediction-decoder Dp to generate Yp, training with Lmse and Lgd.

  4. 04

    Robot Manipulation Learned From Episodic Memory

    Deep Episodic Memory encodes human demonstrations, retrieves similar manipulation episodes for ARMAR-IIIa, and transfers motion via dynamic movement primitives.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    New deep network for action frames

    Deep Episodic Memory introduces encoder network E and latent vector V of dimension 2000 to encode action frames into a low dimensional subsymbolic space.

  • 02

    Reconstruction and future prediction

    Deep Episodic Memory uses reconstruction-decoder Dr and prediction-decoder Dp with combined Lmse and Lgd loss (η = 0.4) to reconstruct and predict video frames.

  • 03

    Matching and retrieving visual episodes

    Deep Episodic Memory applies cosine similarity on latent vectors and PCA with 200 components, reaching 45.55% precision on ActivityNet and 11.81% on 20BN.

RESULTS

By the Numbers

Precision

45.55%

+13.24 over ResNet-50 FV

mAP

28.18%

+4.95 over ResNet-50 FV

Precision

11.81%

+5.73 over ResNet-50 FV on 20BN

mAP

8.32%

+4.34 over ResNet-50 FV on 20BN

On ActivityNet and 20BN-something-something, which test action retrieval as document retrieval, Deep Episodic Memory shows that unsupervised episodic encodings can surpass Fisher Vector and LSTM baselines in first-match precision and mean average precision.

BENCHMARK

By the Numbers

On ActivityNet and 20BN-something-something, which test action retrieval as document retrieval, Deep Episodic Memory shows that unsupervised episodic encodings can surpass Fisher Vector and LSTM baselines in first-match precision and mean average precision.

BENCHMARK

Matching and retrieving visual episodes on ActivityNet

Precision of the first match for different encodings on ActivityNet.

KEY INSIGHT

The Counterintuitive Finding

Deep Episodic Memory trained without labels reaches 45.55% precision on ActivityNet, exceeding supervised feature pipelines like ResNet-50 Fisher Vectors at 32.31%.

This is surprising because Fisher Vectors use strong ImageNet pretrained CNNs, yet subsymbolic episodic encodings plus PCA capture action similarity more effectively than these hand crafted descriptors.

WHY IT MATTERS

What this unlocks for the field

Deep Episodic Memory gives robots a unified way to encode, recall, and predict visual episodes using a single latent vector V for each experience.

Builders can now implement case based reasoning over raw video, retrieving similar actions and transferring motion to new objects without dense labels or thousands of reinforcement learning trials.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

Questions about this paper?

Paper: Deep Episodic Memory: Encoding, Recalling, and Predicting Episodic Experiences for Robot Action Execution

Answers use this explainer on Memory Papers.

Checking…