Solving Continuous Control with Episodic Memory

AuthorsIgor Kuznetsov, Andrey Filchenkov

arXiv 20212021

TL;DR

Episodic Memory Actor Critic (EMAC) adds a Monte Carlo episodic memory term to the critic loss and achieves 2236.88 average return on Walker2d-v3 vs 1008.32 for TD3 (+1228.56).

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Continuous control agents waste rare high return experiences in large action spaces

Episodic memory methods previously improved sample efficiency only in discrete action problems with few actions and many repeating states.

In high dimensional continuous control, DDPG and related actor critic systems struggle to reuse rare but promising trajectories, leading to Q value overestimation and slow learning.

HOW IT WORKS

Episodic Memory Actor Critic objective and replay design

EMAC introduces a Memory Module, Episodic-based Experience Replay Prioritization, and a modified critic objective on top of DDPG to inject Monte Carlo returns into training.

You can think of the Memory Module as a hippocampus like cache that stores compressed state action keys with long term returns, while the replay buffer acts like working memory.

By blending Bellman targets with retrieved episodic returns, EMAC gives the critic a pessimistic second opinion that a plain context limited critic cannot obtain from recent transitions alone.

DIAGRAM

Memory lookup and Q estimate fusion pipeline

This diagram shows how EMAC projects state action pairs, retrieves nearest episodic returns, and fuses them with Bellman targets in the critic objective.

DIAGRAM

Training and evaluation pipeline on OpenAI Gym

This diagram shows how EMAC interacts with OpenAI Gym environments, fills buffers, and runs evaluation episodes every 1000 steps for 200000 steps.

PROCESS

How EMAC Handles a Training Episode

01
Memory Module
EMAC uses the Memory Module to store random projection keys of concatenated state action pairs with their true discounted Monte Carlo returns at the end of each episode.
02
Episodic-based Experience Replay Prioritization
EMAC assigns sampling priorities P(i) using stored Monte Carlo returns with exponent β, increasing reuse of high return transitions during off policy updates.
03
Alleviating Q-value Overestimation
EMAC blends Bellman targets and episodic Monte Carlo estimates in the critic loss JQ with coefficient α, penalizing optimistic critic values using pessimistic episodic returns.
04
EMAC
EMAC runs DDPG style actor critic updates with the modified critic objective and prioritized replay, periodically evaluating on OpenAI Gym domains for 200000 time steps.

KEY CONTRIBUTIONS

Key Contributions

01
Episodic Memory Actor Critic
EMAC augments DDPG with a Memory Module that stores projected state action keys and Monte Carlo returns, modifying the critic loss with an α weighted episodic term.
02
Alleviating Q-value Overestimation
EMAC shows that adding episodic Monte Carlo returns as a second estimate reduces Q value overestimation compared to TD3, yielding more realistic critic predictions during training.
03
Episodic-based Experience Replay Prioritization
EMAC introduces replay prioritization based on episodic returns with exponent β = 0.5, improving average returns over the non prioritized EMAC NoPr variant on multiple environments.

RESULTS

By the Numbers

Walker2d-v3 average return

2236.88

+1228.56 over TD3

Hopper-v3 average return

1969.16

+979.2 over TD3

Swimmer-v3 average return

80.53

+41.62 over TD3

InvertedDoublePendulum-v2 average return

9332.56

+26.75 over TD3

On OpenAI Gym continuous control with 200000 time steps, EMAC is compared against DDPG, TD3, SAC, and EMAC NoPr. These results show that EMAC consistently achieves higher average returns, especially on Walker2d-v3 and Hopper-v3, in the small data regime.

BENCHMARK

By the Numbers

BENCHMARK

Average return over 10 trials of 200000 time steps on Walker2d-v3

Average return on Walker2d-v3 after 200000 environment steps.

KEY INSIGHT

The Counterintuitive Finding

EMAC uses a pessimistic episodic Monte Carlo estimate, yet still achieves 2236.88 average return on Walker2d-v3, beating SAC at 1787.28.

This is surprising because one might expect pessimistic targets to slow learning, but EMAC shows they instead reduce harmful Q value overestimation and accelerate convergence.

WHY IT MATTERS

What this unlocks for the field

EMAC demonstrates that episodic memory can be integrated into continuous control actor critic algorithms to improve sample efficiency and stabilize Q estimates.

Builders can now design continuous control agents that reuse rare high return trajectories via episodic memory, without changing the online policy architecture or relying solely on TD targets.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…