Generalizable Episodic Memory for Deep Reinforcement Learning

AuthorsHao Hu, Jianing Ye, Guangxiang Zhu et al.

arXiv 20212021

TL;DR

Generalizable Episodic Memory (GEM) uses twin implicit-planning value networks over a parametric memory table to boost MuJoCo continuous-control returns beyond TD3 and SAC within 1M steps.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Continuous domains where a state is never visited twice

Episodic memory methods usually update only on exactly re-encountered events, which fails in continuous domains where a state is never visited twice.

In continuous control tasks like MuJoCo, this prevents effective trajectory aggregation, so agents cannot reuse past successful strategies and suffer from high sample complexity and weak generalization.

HOW IT WORKS

Generalizable Episodic Memory with twin implicit planning

GEM introduces a parametric network Mθ, implicit memory-based planning, a twin back-propagation process, and conservative estimation on single step to build generalizable episodic memory.

You can view GEM like a learned virtual memory table, where Mθ is the RAM, the tabular memory M is the disk, and twin back-propagation is a planning coprocessor.

This KEY_MECHANISM of using twin networks for trajectory-wise double estimation lets GEM safely search over combinatorial rollouts, something a plain context window or vanilla TD target cannot do.

DIAGRAM

Implicit memory based planning along trajectories

This diagram shows how GEM performs implicit memory-based planning and twin back-propagation along stored trajectories to compute enhanced returns Rt.

DIAGRAM

Training loop and memory update schedule

This diagram shows GEM's overall training loop, including environment interaction, periodic memory updates, and actor critic optimization.

PROCESS

How Generalizable Episodic Memory Handles a Control Episode

01
Collect transitions in tabular memory M
GEM interacts with the environment, storing each transition tuple (s, a, r, s′) into the tabular memory M for later planning.
02
Update memory with twin back propagation
GEM runs Update Memory, using implicit memory based planning and the twin back propagation process to compute enhanced returns R(1,2)_t.
03
Train parametric network Mθ with regression
GEM trains the parametric memory Mθ by minimizing squared or asymmetric loss between Q(1,2)_θ(st, at) and the enhanced returns from memory.
04
Update actor by deterministic policy gradient
GEM updates the actor πφ using deterministic policy gradient with Qθ1, improving the policy based on the generalizable episodic values.

KEY CONTRIBUTIONS

Key Contributions

01
Generalizable Episodic Memory framework GEM
GEM proposes a parametric memory Mθ plus implicit memory based planning and twin back propagation to aggregate returns across trajectories in continuous domains.
02
Twin back propagation process TBP
GEM introduces a twin back propagation process that uses a double estimator over rollout lengths h to avoid trajectory induced overestimation.
03
Theoretical analysis of GEM
GEM is proven non overestimating under unbiased Q, convergent to Q∗ in deterministic MDPs, and bounded within 2µ over 1 minus γ in near deterministic MDPs.

RESULTS

By the Numbers

Average Return Ant v2

≈6000 score

≈+1500 over TD3

Average Return HalfCheetah v2

≈12000 score

vs TD3 and SAC within 1M steps

Average Return Humanoid v2

≈6000 score

higher stability and return than TD3+SIL

Atari sample steps

10.0e6 steps

GEM curve dominates DQN, DDQN, Dueling DQN

The MuJoCo benchmark suite with Ant v2, HalfCheetah v2, Humanoid v2, Swimmer v2, Walker v2, and Hopper v2 tests continuous control performance and sample efficiency. These MAIN_RESULT style curves show GEM reaching higher returns than TD3, SAC, DDPG, and TD3+SIL within 1M steps, and better learning curves than DQN variants on six Atari games.

BENCHMARK

By the Numbers

BENCHMARK

Learning curves on MuJoCo tasks compared with baseline algorithms

Average return at 1M environment steps on representative MuJoCo tasks.

KEY INSIGHT

The Counterintuitive Finding

GEM’s twin back propagation process avoids overestimation even though it maximizes over rollout lengths, which usually increases bias in value methods.

This is surprising because taking maxima over many counterfactual trajectories should amplify noise, yet GEM’s double estimator structure keeps the expected value below the true maximum.

WHY IT MATTERS

What this unlocks for the field

GEM unlocks generalizable episodic control in high dimensional continuous spaces, where exact state re encounters are effectively impossible.

Builders can now combine fast episodic reuse with safe multi step planning, enabling continuous control agents that learn from sparse successes without brittle tabular memories.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…