Working Memory Graphs

AuthorsRicky Loynd, Roland Fernandez, Asli Celikyilmaz et al.

arXiv 20192019

TL;DR

Working Memory Graphs uses persistent Memo vectors with Transformer self attention to reach near Depth 3 reasoning on the Pathfinding task while generalizing to 24 step graphs with 93.9 percent quiz accuracy.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Sequential agents struggle with long horizon reasoning in POMDPs

In the Pathfinding task, a GRU based agent reached only 84.4% of possible quiz score on 24 step episodes, versus 93.9% for Working Memory Graphs.

This gap shows that GRU hidden state chains lose critical past information, limiting reasoning over partially observed graphs and degrading downstream decision quality.

HOW IT WORKS

Working Memory Graphs shortcut recurrence with Memos and Factors

Working Memory Graphs introduces Memos, Factors, a Core vector, and a multi layer Transformer to build a dynamic working memory graph over time.

You can think of Memos as RAM slots that persist key facts, while Factors are like incoming packets and the Core is the controller routing attention.

This shortcut recurrence lets Working Memory Graphs move information along many short attention paths instead of a single recurrent chain, enabling reasoning over long histories that a fixed context window cannot capture.

DIAGRAM

Information flow over time in Working Memory Graphs

This diagram shows how Working Memory Graphs propagates information from early observations to later actions via persistent Memos in the unrolled Pathfinding task.

DIAGRAM

Evaluation pipeline across Pathfinding, BabyAI, and Sokoban

This diagram shows how Working Memory Graphs is trained and evaluated on Pathfinding, BabyAI levels, and Sokoban with hyperparameter tuning via Distributed Grid Descent.

PROCESS

How Working Memory Graphs Handles a Pathfinding episode

  1. 01

    Observation processing

    Working Memory Graphs receives the current Pathfinding observation and encodes it into the Core vector, optionally creating Factor vectors when observations are factored.

  2. 02

    Transformer self attention

    Working Memory Graphs stacks the Core, Factors, and Memos into the Transformer input and applies multi head self attention across all vectors.

  3. 03

    Memo update

    Working Memory Graphs takes the Core output h, applies a nonlinear layer tanh(hWM + bM), and inserts the new Memo while shifting older Memos.

  4. 04

    Actor critic decision

    Working Memory Graphs feeds h through the shared actor critic layer sac and then computes the policy π and value V to choose the quiz answer action.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Working Memory Graph architecture

    Working Memory Graphs introduces persistent Memos, Factor vectors, and a Core vector processed by a Transformer encoder to implement shortcut recurrence over long histories.

  • 02

    Synergy with factored observations

    Working Memory Graphs shows that factored observations in BabyAI yield up to 10x better sample efficiency on Level 3 compared to CNN+GRU with the native 7x7 image.

  • 03

    Strong results on diverse tasks

    Working Memory Graphs nearly reaches Depth 3 performance on Pathfinding, solves BabyAI PickupLoc, and surpasses DRC on Sokoban using factored observation spaces.

RESULTS

By the Numbers

Quiz reward percentage

93.9%

+9.5 percentage points over GRU on 24 step Pathfinding episodes

BabyAI Level 3 steps

16.0k

+188.9k fewer interactions than GRU factored on GoToRedBall

BabyAI Level 3 CNN steps

204.9k

CNN+GRU native 7x7x3 requires 204.9k interactions to reach 99% success

BabyAI Level 5 steps

222.3k

Working Memory Graphs factored solves PickupLoc in 222.3k interactions; GRU variants do not reach 99% within 6M steps

These numbers come from the Pathfinding generalization test and BabyAI sample efficiency table, showing how Working Memory Graphs leverages Memos and factored observations to reduce environment interactions while maintaining high accuracy.

BENCHMARK

By the Numbers

These numbers come from the Pathfinding generalization test and BabyAI sample efficiency table, showing how Working Memory Graphs leverages Memos and factored observations to reduce environment interactions while maintaining high accuracy.

BENCHMARK

BabyAI GoToRedBall sample efficiency comparison

Thousands of environment interactions needed to reach 99% success on BabyAI Level 3 GoToRedBall.

KEY INSIGHT

The Counterintuitive Finding

Working Memory Graphs with only 6 Memos still beats a non recurrent Working Memory Graphs baseline that sees the last 6 observations in Pathfinding.

This is surprising because stacking raw observations, like DQN, seems richer than a tiny recurrent buffer, yet Working Memory Graphs uses shortcut recurrence to reason beyond that window.

WHY IT MATTERS

What this unlocks for the field

Working Memory Graphs shows that Transformer style self attention over a small set of persistent Memos can replace long recurrent chains for RL agents in POMDPs.

Builders can now design agents that exploit factored observation spaces and compact working memories to achieve long horizon reasoning and high sample efficiency in complex environments like BabyAI and Sokoban.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

Questions about this paper?

Paper: Working Memory Graphs

Answers use this explainer on Memory Papers.

Checking…