Retrieval-Augmented Decision Transformer: External Memory for In-context RL

AuthorsThomas Schmied, Fabian Paischer, Vihang Patil et al.

2024

TL;DR

Retrieval-Augmented Decision Transformer (RA-DT) adds a vector-index external memory with cross-attention over retrieved sub-trajectories, enabling near-optimal Dark-Room 10×10 in-context performance with only 50-step context instead of full episodes.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

In-context RL breaks on long sparse episodes with full-episode context

Existing in-context RL methods require keeping entire episodes in context, which is infeasible when episodes contain thousands of interaction steps and sparse rewards.

This limitation affects Decision Transformer style agents and Algorithm Distillation, making them hard to apply to Atari-like or real-world tasks where long episodes and sparse rewards are ubiquitous.

HOW IT WORKS

Retrieval-Augmented Decision Transformer — external memory for in-context RL

RA-DT’s core mechanism combines a vector index, embedding model g, maximum inner product search, experience reweighting, and cross-attention layers on top of a Decision Transformer backbone.

You can think of the vector index as a searchable disk of past trajectories, while the Decision Transformer acts like fast RAM that pulls in only the most relevant sub-trajectories.

This design lets RA-DT condition on targeted, high-utility experiences from arbitrarily long histories instead of being constrained by a fixed context window of entire episodes.

DIAGRAM

Inference Flow of RA-DT in In-context RL

This diagram shows how RA-DT uses the current sub-trajectory to query the vector index, reweights retrieved experiences, and predicts the next action during inference.

DIAGRAM

Training and Evaluation Pipeline for RA-DT

This diagram summarizes how RA-DT is trained on offline trajectories and later evaluated with in-context trials while populating the retrieval buffer online.

PROCESS

How RA-DT Handles an In-context RL Trial

  1. 01

    Vector index for retrieval augmentation

    RA-DT first uses the embedding model g to map sub-trajectories into a vector space and builds key value pairs for the vector index.

  2. 02

    Searching for similar experiences

    Given the current context, RA-DT runs maximum inner product search over the vector index to retrieve the top l most similar sub-trajectories.

  3. 03

    Reweighting retrieved experiences

    RA-DT reweights retrieved sub-trajectories using relevance and utility scores, selecting the top k items that best match the current task and return profile.

  4. 04

    Incorporating retrieved experiences

    RA-DT feeds the selected sub-trajectories through cross attention layers interleaved with self attention to condition action prediction on both context and retrieved memory.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Retrieval-augmented Decision Transformer

    RA-DT augments Decision Transformer with a vector index, embedding model g, and cross attention layers to retrieve and fuse relevant sub-trajectories for in-context RL.

  • 02

    Domain-agnostic trajectory embedding model

    RA-DT uses the FrozenHopfield mechanism with BERT as embedding model g, showing domain-agnostic retrieval can match domain-specific DT embeddings on grid-worlds.

  • 03

    Released datasets for in-context decision making

    RA-DT is evaluated on Dark-Room, Dark Key-Door, Maze-Runner, Meta-World, DMControl, and Procgen, and releases datasets for four environments to support future work.

RESULTS

By the Numbers

Mean reward Dark Room 10x10

near-optimal over 40 trials

RA-DT uses context length 50 vs 100 for Algorithm Distillation

Mean reward Dark Room 20x20

higher final reward

RA-DT improves over Decision Pre-trained Transformer with shorter context

Mean reward Maze Runner

0.4 on test mazes

RA-DT vs 0.65 on train mazes highlights generalization gap

Training speed Dark Room 40x20

up to 7.0x faster

RA-DT vs baselines due to shorter context length

These results come from Dark-Room, Dark Key-Door, and Maze-Runner benchmarks that stress in-context RL with long, sparse-reward episodes, showing RA-DT maintains performance while drastically reducing context length and training time.

BENCHMARK

By the Numbers

These results come from Dark-Room, Dark Key-Door, and Maze-Runner benchmarks that stress in-context RL with long, sparse-reward episodes, showing RA-DT maintains performance while drastically reducing context length and training time.

BENCHMARK

ICL performance on Dark-Room 10x10 at end of training

Mean reward over 40 in-context trials on 20 evaluation goals in Dark-Room 10x10.

KEY INSIGHT

The Counterintuitive Finding

Domain-agnostic RA-DT using FrozenHopfield plus BERT matches or even exceeds the domain-specific embedding model on Dark Key-Door 40×20.

This is surprising because one would expect a DT trained on the same domain to produce strictly better retrieval keys than a frozen language model trained only on text.

WHY IT MATTERS

What this unlocks for the field

RA-DT unlocks retrieval-augmented in-context RL where agents can leverage arbitrarily long histories without inflating the Transformer context window.

Builders can now plug generic language encoders into RL pipelines as external memory, enabling rapid task adaptation without expert demonstrations or parameter updates.

~14 min read← Back to papers

Related papers

RAG

A Dynamic Retrieval-Augmented Generation System with Selective Memory and Remembrance

Okan Bursa

· 2026

Adaptive RAG Memory (ARM) augments a standard retriever–generator stack with a Dynamic Embedding Layer and Remembrance Engine that track usage statistics and apply selective remembrance and decay to embeddings. On a lightweight retrieval benchmark, ARM achieves NDCG@5 ≈ 0.9401 and Recall@5 = 1.000 with 22M parameters, matching larger baselines like gte-small while providing the best efficiency among ultra-efficient models.

RAGLong-Term Memory

HingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable Dialogues

Yijie Zhong, Yunfan Gao, Haofen Wang

· 2026

HingeMem combines Boundary Guided Long-Term Memory, Dialogue Boundary Extraction, Memory Construction, Query Adaptive Retrieval, Hyperedge Rerank, and Adaptive Stop to segment dialogues into element-indexed hyperedges and plan query-specific retrieval. On LOCOMO, HingeMem achieves 63.9 overall F1 and 75.1 LLM-as-a-Judge score, surpassing the best baseline Zep (56.9 F1) by 7.0 F1 without using category-specific QA formats.

Questions about this paper?

Paper: Retrieval-Augmented Decision Transformer: External Memory for In-context RL

Answers use this explainer on Memory Papers.

Checking…