Evaluating Long-Term Memory in 3D Mazes

AuthorsJurgis Pasukonis, Timothy Lillicrap, Danijar Hafner

2022

TL;DR

Memory Maze uses randomized 3D mazes plus truncated backpropagation through time to expose that Dreamer (TBTT) reaches 33.2 return on 9x9 vs 23.4 for IMPALA.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

RL agents lack long-term memory in partially observed 3D mazes

Memory Maze is motivated by the fact that many current algorithms are still limited to environments that are mostly fully observed and struggle in partially-observed scenarios where the agent needs to integrate and retain information over many time steps.

This limitation means agents with a first-person view fail to remember object positions, maze layouts, and their own location, which directly harms navigation and planning performance.

HOW IT WORKS

Memory Maze — a 3D benchmark for long-term memory

Memory Maze centers on four Memory Maze tasks, a diverse offline dataset, and an offline probing protocol that together isolate long-term memory from exploration and credit assignment.

You can think of Memory Maze like a 3D scavenger hunt where the agent’s recurrent state is its RAM, the offline dataset is a recorded hard drive, and probing acts as a debugger reading internal registers.

This design lets Memory Maze test whether agents like Dreamer and IMPALA actually build internal maps and object memories, rather than just exploiting a short context window of recent frames.

DIAGRAM

Episode flow in Memory Maze online environment

This diagram shows how a Memory Maze episode unfolds as the agent repeatedly navigates to prompted colored objects using long-term memory.

DIAGRAM

Offline dataset and probing evaluation pipeline

This diagram shows how Memory Maze generates offline trajectories and evaluates representations via supervised probes on walls and object locations.

PROCESS

How Memory Maze Handles a 3D navigation episode

  1. 01

    Environment

    Memory Maze initializes a randomized 3D maze from the four Memory Maze tasks, placing 3 to 6 colored objects and fixing the wall layout.

  2. 02

    Human Performance

    Memory Maze records a human player navigating with first-person observations, showing rewards rising within each episode as the player memorizes object positions and maze connectivity.

  3. 03

    Offline Dataset

    Memory Maze uses an MPC planner with breadth first search to generate 30k trajectories per size, storing images, actions, rewards, maze_layout, agent_pos, and targets_vec.

  4. 04

    Offline Probing

    Memory Maze trains sequence models like RSSM and VAE GRU on the offline dataset, then fits probe networks to predict maze_layout and targets_vec from frozen 2048 dimensional states.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Environment

    Memory Maze introduces four MuJoCo based 3D navigation tasks (9x9 to 15x15) where episode return equals the number of correctly reached colored objects, with oracle scores up to 87.7.

  • 02

    Offline Dataset

    Memory Maze releases two 30M step datasets of 30k trajectories each, including semantic keys like maze_layout, agent_pos, targets_pos, and targets_vec for offline RL and probing.

  • 03

    Offline Probing

    Memory Maze defines four probing benchmarks (Walls and Objects on 9x9 and 15x15) using a fixed 4 layer MLP probe on 2048 dimensional states to quantify long term memory.

RESULTS

By the Numbers

Return Memory 9x9

33.2

+9.8 over IMPALA

Return Memory 11x11

31.4

+3.2 over IMPALA

Return Memory 13x13

21.4

-6.1 vs Human

Return Memory 15x15

17.7

-50.0 vs Oracle

These returns come from the online RL benchmark on the four Memory Maze tasks after 100M environment steps, comparing Dreamer variants, IMPALA, humans, and an oracle planner. The results show Memory Maze exposes large gaps between RL agents and human level long term memory, especially on 13x13 and 15x15 mazes.

BENCHMARK

By the Numbers

These returns come from the online RL benchmark on the four Memory Maze tasks after 100M environment steps, comparing Dreamer variants, IMPALA, humans, and an oracle planner. The results show Memory Maze exposes large gaps between RL agents and human level long term memory, especially on 13x13 and 15x15 mazes.

BENCHMARK

Online RL benchmark results after 100M environment steps on Memory 9x9

Episode return on Memory 9x9 after 100M steps of training.

KEY INSIGHT

The Counterintuitive Finding

On Memory 9x9, Dreamer (TBTT) reaches 33.2 return, exceeding the human player at 26.4 and approaching the oracle at 34.8.

This is surprising because Memory Maze is designed as a human challenging navigation task, yet truncated backpropagation through time lets Dreamer (TBTT) nearly match an oracle with full maze_layout access.

WHY IT MATTERS

What this unlocks for the field

Memory Maze gives researchers a controlled way to separate long term memory from exploration and credit assignment using walls and objects probes.

With Memory Maze, builders can stress test recurrent architectures, training regimes like truncated backpropagation through time, and auxiliary probe losses before deploying agents in complex partially observed 3D environments.

~12 min read← Back to papers

Related papers

Agent MemoryLong-Term Memory

Adaptive Memory Admission Control for LLM Agents

Guilin Zhang, Wei Jiang et al.

· 2026

A-MAC scores candidate memories using Utility, Confidence, Novelty, Recency, and Type Prior combined by a learned linear admission policy with Algorithm 1 A-MAC Memory Admission. On the LoCoMo benchmark, A-MAC achieves F1 0.583 and 2644 ms latency, improving F1 by 0.042 and reducing latency by 1187 ms compared to A-mem.

Long-Term Memory

Advancing Open-source World Models

Robbyant Team, Zelin Gao et al.

arXiv 2026 · 2026

LingBot-World combines a Data Engine, Fundamental World Model, Action-Conditioned World Model, and Post-Training causal adaptation to turn a 28B-parameter video generator into a real-time interactive world simulator. On the VBench benchmark, LingBot-World achieves a dynamic degree of 0.8857 versus 0.7612 for Yume-1.5, while also improving imaging quality to 0.6683.

BenchmarkBenchmarkLong-Term Memory

AgenticAI-DialogGen: Topic-Guided Conversation Generation for Fine-Tuning and Evaluating Short- and Long-Term Memories of LLMs

Manoj Madushanka Perera, Adnan Mahmood et al.

· 2026

AgenticAI-DialogGen chains ChatPreprocessor, KnowledgeExtractor, TopicAnalyzer, KnowledgeGraphBuilder, PersonaGenerator, DuelingChat Agent, ConversationValidator, ConversationRefiner, QAGeneration, and PostProcessing to turn raw multi-session chats into topic-guided, persona-grounded conversations with explicit short- and long-term memories. On the TGC / KG memory QA benchmark, Mistral-7B fine-tuned within AgenticAI-DialogGen achieves 87.36 F1, compared to GPT-4’s 83.77 F1 in a zero-shot setting on the same task.

Questions about this paper?

Paper: Evaluating Long-Term Memory in 3D Mazes

Answers use this explainer on Memory Papers.

Checking…