Offline Reinforcement Learning with Value-based Episodic Memory

AuthorsXiaoteng Ma, Yiqin Yang, Hao Hu et al.

arXiv 20212021

TL;DR

Value-based Episodic Memory (VEM) combines Expectile V-Learning and implicit episodic planning to reach 87.5 on antmaze-umaze and 128.3 on adroit-hammer-expert, surpassing prior offline RL baselines.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Offline RL suffers from extrapolation error on unseen actions

Offline reinforcement learning must evaluate actions outside the dataset’s support, where value networks suffer from extrapolation error that is magnified through bootstrapping.

In safety-critical or sparse-reward tasks, this causes severe estimation errors and unreliable policies, forcing Q-based offline RL to add heavy regularization or constraints.

HOW IT WORKS

Value-based Episodic Memory with Expectile V-Learning

Expectile V-Learning, Implicit Memory-based Planning, and Generalized Advantage-weighted Learning form the core of Value-based Episodic Memory (VEM), keeping learning within dataset support.

You can think of EVL as adjusting a slider between behavior cloning and optimal value learning, while episodic planning acts like a trajectory-aware cache that reuses the best returns.

This combination lets VEM avoid unseen actions, accelerate value propagation along offline trajectories, and provide stable advantages that a plain context-limited Q-learner cannot.

DIAGRAM

Offline Expectile V-Learning and Planning Flow

This diagram shows how VEM applies Expectile V-Learning and implicit memory-based planning along offline trajectories to compute enhanced returns and advantages.

DIAGRAM

D4RL Evaluation and Ablation Pipeline

This diagram shows how VEM is evaluated on D4RL AntMaze, Adroit, and MuJoCo tasks and compared to BCQ, CQL, AWR, and BAIL with ablations.

PROCESS

How Value-based Episodic Memory Handles an Offline Training Session

01
Expectile V-Learning
VEM applies Expectile V-Learning to minimize the loss in Equation 4, updating the value network using the gradient expectile operator T^mu_tau.
02
Implicit Memory-based Planning
VEM runs implicit memory-based planning along offline trajectories, recursively computing ˆRt via Equation 5 or 6 using the current expectile V-values.
03
Generalized Advantage-weighted Learning
VEM calculates advantages ˆA(st,at)=ˆRt−ˆV(st) and performs generalized advantage-weighted regression as in Equation 7 to update the policy.
04
Value-based Episodic Memory Update
VEM repeats EVL, planning, and advantage-weighted learning, forming the full Value-based Episodic Memory loop described in Algorithm 1.

KEY CONTRIBUTIONS

Key Contributions

01
Expectile V-Learning and VEM framework
VEM introduces Expectile V-Learning to interpolate between behavior cloning and optimal value learning, with a contraction rate γτ = 1 − 2α(1 − γ) min{τ,1 − τ} and provable convergence.
02
Value-based episodic memory planning
VEM integrates implicit memory-based planning that strictly plans within offline trajectories using ˆRt = rt + γ max(ˆRt+1, ˆV(st+1)) to accelerate value propagation.
03
State-of-the-art D4RL performance
VEM achieves 87.5 on antmaze-umaze, 78.0 on antmaze-medium-play, and 128.3 on adroit-hammer-expert, surpassing BAIL, BCQ, CQL, and AWR on most AntMaze and Adroit tasks.

RESULTS

By the Numbers

Normalized score antmaze umaze fixed

87.5

+9.5 over BCQ

Normalized score antmaze medium play

78.0

+78.0 over BCQ

Normalized score adroit hammer expert

128.3

+4.8 over BAIL

Normalized score adroit relocate expert

109.8

+15.4 over BAIL

These normalized scores come from the D4RL benchmark, where 0 corresponds to a random policy and 100 to an expert. The results show that VEM handles sparse-reward navigation (AntMaze) and high-dimensional dexterous control (Adroit) better than BCQ, CQL, AWR, and BAIL using only offline data.

BENCHMARK

By the Numbers

BENCHMARK

D4RL AntMaze umaze fixed normalized scores

Normalized score (0 random, 100 expert) on antmaze-umaze-fixed from the D4RL benchmark.

BENCHMARK

D4RL Adroit hammer expert normalized scores

Normalized score (0 random, 100 expert) on adroit-hammer-expert from the D4RL benchmark.

KEY INSIGHT

The Counterintuitive Finding

In a random MDP, VEM with a properly chosen expectile τ can match near-optimal values despite noisy offline operators, while τ too close to 1 causes overestimation.

This is surprising because offline RL intuition suggests always pushing toward the Bellman optimality operator, but VEM shows that a more conservative τ often yields better fixed-point accuracy.

WHY IT MATTERS

What this unlocks for the field

VEM unlocks stable offline value learning that never leaves dataset support, yet still approaches optimal policies through expectile tuning and episodic planning.

Builders can now tackle long-horizon, sparse-reward tasks like AntMaze and Adroit using only logged trajectories, without training behavior models or dynamics models or hand-tuning heavy Q-function regularizers.

~13 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…