Episodic Memory Deep Q-Networks

AuthorsZichuan Lin, Tianqi Zhao, Guangwen Yang, Lintao Zhang

arXiv 20182018

TL;DR

Episodic Memory Deep Q-Networks (EMDQN) adds a best-return episodic memory target to DQN’s loss, reaching 528.4% mean human-normalized score at 40M frames vs 151.2% for DQN.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Sample inefficient deep RL needs hundreds of millions of frames

DQN requires hundreds of millions of interactions with the environment to learn a good policy and generalize to unseen states, leading to low data efficiency.

On Atari and robotics control tasks, this sample inefficiency makes training expensive and slow, and forces DQN to use a small learning rate, slowing reward propagation and policy improvement.

HOW IT WORKS

Episodic Memory Deep Q-Networks — dual targets from striatum and hippocampus

EMDQN combines a parametric Qθ(s, a) network with an inference target S and an episodic memory target H stored in a memory table using random projection and kd-tree.

You can think of S as a slow-learning striatum and H as a fast hippocampus that remembers the best returns and nudges Qθ toward those memories.

By learning from both S and H in the loss L = α(Qθ − S)² + β(Qθ − H)², EMDQN propagates rewards faster and regularizes Q-values in ways plain DQN’s context-limited bootstrapping cannot.

DIAGRAM

Episodic memory write and update within an episode

This diagram shows how EMDQN caches transitions, computes Monte Carlo returns, and updates the episodic memory table H in reverse order at episode end.

DIAGRAM

Training and evaluation pipeline on Atari 2600

This diagram shows how EMDQN is trained for 40M and 200M frames on ALE and evaluated with human-normalized scores.

PROCESS

How Episodic Memory Deep Q-Networks Handles an Atari training run

01
Interact with environment
EMDQN observes state s, selects actions from Qθ(s, a), and collects transitions (s, a, r, s') while rolling out episodes on ALE.
02
Write episodic memory
EMDQN projects states with random projection φ, caches (φ(s), a, r) along each episode, and later updates the memory table H in reverse order.
03
Update Q network
EMDQN samples mini batches from replay D and minimizes L = α(Qθ − S)² + β(Qθ − H)², backpropagating gradients from both targets.
04
Periodic evaluation
EMDQN runs 30 evaluation episodes per epoch, computes human-normalized scores over 57 Atari games, and compares against DQN, NEC, and MFEC.

KEY CONTRIBUTIONS

Key Contributions

01
Episodic Memory Deep Q-Networks
EMDQN introduces a dual-target loss combining inference target S and episodic memory target H, enabling faster reward propagation and reduced variance compared to plain Qθ(s, a) updates.
02
Biologically inspired dual system
EMDQN explicitly models a striatum like parametric decision system and a hippocampus like non parametric memory table, tuned via λ = β α to interpolate between DQN and episodic control.
03
Sample efficient Atari performance
EMDQN reaches 528.4% mean human normalized score at 40M frames, surpassing DQN at 200M frames which achieves 227.9%, and NEC at 40M frames which achieves 144.8%.

RESULTS

By the Numbers

Mean human normalized score

528.4%

+377.2 over DQN(40M)

Median human normalized score

92.8%

+40.1 over DQN(40M)

Mean score vs NEC

528.4%

+383.6 over NEC(40M)

Mean score vs DQN 200M

528.4%

+300.5 over DQN(200M)

On the 57 game Atari 2600 benchmark from ALE, EMDQN is trained for 40M frames and evaluated with human normalized scores. The 528.4% mean score shows that EMDQN can match and exceed DQN and NEC while using only one fifth of DQN’s 200M frame budget.

BENCHMARK

By the Numbers

BENCHMARK

Human normalized scores at 40M and 200M frames over 57 Atari games

Mean human normalized score across 57 Atari games for EMDQN and baseline agents.

KEY INSIGHT

The Counterintuitive Finding

EMDQN trained on only 40M frames reaches a 528.4% mean human normalized score, while DQN trained on 200M frames reaches just 227.9%.

This is surprising because a simple DQN style architecture, augmented only with an episodic memory target H, beats more data hungry baselines despite using one fifth of their environment interactions.

WHY IT MATTERS

What this unlocks for the field

EMDQN shows that combining a parametric Q network with a best return episodic memory table can dramatically improve sample efficiency in near deterministic environments.

Builders can now design RL agents that latch onto high reward behaviors quickly, propagate rewards over long horizons, and mitigate Q value overestimation without complex architectural changes.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…