Solving Continuous Control with Episodic Memory

AuthorsIgor Kuznetsov, Andrey Filchenkov

arXiv 20212021

TL;DR

Episodic Memory Actor Critic (EMAC) adds a Monte Carlo episodic memory term to the critic loss and achieves 2236.88 average return on Walker2d-v3 vs 1008.32 for TD3 (+1228.56).

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Continuous control agents waste rare high return experiences in large action spaces

Episodic memory methods previously improved sample efficiency only in discrete action problems with few actions and many repeating states.

In high dimensional continuous control, DDPG and related actor critic systems struggle to reuse rare but promising trajectories, leading to Q value overestimation and slow learning.

HOW IT WORKS

Episodic Memory Actor Critic objective and replay design

EMAC introduces a Memory Module, Episodic-based Experience Replay Prioritization, and a modified critic objective on top of DDPG to inject Monte Carlo returns into training.

You can think of the Memory Module as a hippocampus like cache that stores compressed state action keys with long term returns, while the replay buffer acts like working memory.

By blending Bellman targets with retrieved episodic returns, EMAC gives the critic a pessimistic second opinion that a plain context limited critic cannot obtain from recent transitions alone.

DIAGRAM

Memory lookup and Q estimate fusion pipeline

This diagram shows how EMAC projects state action pairs, retrieves nearest episodic returns, and fuses them with Bellman targets in the critic objective.

DIAGRAM

Training and evaluation pipeline on OpenAI Gym

This diagram shows how EMAC interacts with OpenAI Gym environments, fills buffers, and runs evaluation episodes every 1000 steps for 200000 steps.

PROCESS

How EMAC Handles a Training Episode

  1. 01

    Memory Module

    EMAC uses the Memory Module to store random projection keys of concatenated state action pairs with their true discounted Monte Carlo returns at the end of each episode.

  2. 02

    Episodic-based Experience Replay Prioritization

    EMAC assigns sampling priorities P(i) using stored Monte Carlo returns with exponent β, increasing reuse of high return transitions during off policy updates.

  3. 03

    Alleviating Q-value Overestimation

    EMAC blends Bellman targets and episodic Monte Carlo estimates in the critic loss JQ with coefficient α, penalizing optimistic critic values using pessimistic episodic returns.

  4. 04

    EMAC

    EMAC runs DDPG style actor critic updates with the modified critic objective and prioritized replay, periodically evaluating on OpenAI Gym domains for 200000 time steps.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Episodic Memory Actor Critic

    EMAC augments DDPG with a Memory Module that stores projected state action keys and Monte Carlo returns, modifying the critic loss with an α weighted episodic term.

  • 02

    Alleviating Q-value Overestimation

    EMAC shows that adding episodic Monte Carlo returns as a second estimate reduces Q value overestimation compared to TD3, yielding more realistic critic predictions during training.

  • 03

    Episodic-based Experience Replay Prioritization

    EMAC introduces replay prioritization based on episodic returns with exponent β = 0.5, improving average returns over the non prioritized EMAC NoPr variant on multiple environments.

RESULTS

By the Numbers

Walker2d-v3 average return

2236.88

+1228.56 over TD3

Hopper-v3 average return

1969.16

+979.2 over TD3

Swimmer-v3 average return

80.53

+41.62 over TD3

InvertedDoublePendulum-v2 average return

9332.56

+26.75 over TD3

On OpenAI Gym continuous control with 200000 time steps, EMAC is compared against DDPG, TD3, SAC, and EMAC NoPr. These results show that EMAC consistently achieves higher average returns, especially on Walker2d-v3 and Hopper-v3, in the small data regime.

BENCHMARK

By the Numbers

On OpenAI Gym continuous control with 200000 time steps, EMAC is compared against DDPG, TD3, SAC, and EMAC NoPr. These results show that EMAC consistently achieves higher average returns, especially on Walker2d-v3 and Hopper-v3, in the small data regime.

BENCHMARK

Average return over 10 trials of 200000 time steps on Walker2d-v3

Average return on Walker2d-v3 after 200000 environment steps.

KEY INSIGHT

The Counterintuitive Finding

EMAC uses a pessimistic episodic Monte Carlo estimate, yet still achieves 2236.88 average return on Walker2d-v3, beating SAC at 1787.28.

This is surprising because one might expect pessimistic targets to slow learning, but EMAC shows they instead reduce harmful Q value overestimation and accelerate convergence.

WHY IT MATTERS

What this unlocks for the field

EMAC demonstrates that episodic memory can be integrated into continuous control actor critic algorithms to improve sample efficiency and stabilize Q estimates.

Builders can now design continuous control agents that reuse rare high return trajectories via episodic memory, without changing the online policy architecture or relying solely on TD targets.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

Questions about this paper?

Paper: Solving Continuous Control with Episodic Memory

Answers use this explainer on Memory Papers.

Checking…