MEMO: A Deep Network for Flexible Combination of Episodic Memories

AuthorsAndrea Banino, Adrià Puigdomènech Badia, Raphael Köster et al.

2020

TL;DR

MEMO uses separated episodic facts plus an adaptive halting policy to reach 0.21% error on joint bAbI 10k, beating Memory Networks by 3.99 percentage points.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Memory architectures struggle with distant associations in reasoning tasks

MEMO is motivated by the finding that current architectures struggle to reason over long distance associations and fail on complex inference tasks.

When Paired Associative Inference and shortest path tasks require chaining multiple facts, EMN, DNC, and Universal Transformer often mis-handle indirect queries, limiting robust reasoning.

HOW IT WORKS

MEMO — flexible episodic memory with adaptive hops

MEMO introduces a common embedding for each fact, separate keys and values per head, recurrent attention, and a halting policy that learns how many memory hops to take.

You can think of MEMO like a hippocampus-inspired system: facts are stored as separate episodes, and a recurrent retrieval loop selectively chains them, similar to pattern completion in biological memory.

This KEY_MECHANISM of separated facts plus adaptive multi hop retrieval lets MEMO discover multi step relationships that a fixed depth, single pass context window cannot capture.

DIAGRAM

MEMO inference flow across memory hops

This diagram shows how MEMO repeatedly queries external memory with recurrent attention and uses the halting policy to decide when to stop hopping.

DIAGRAM

Evaluation pipeline across PAI, shortest path, and bAbI

This diagram shows how MEMO is trained and evaluated on Paired Associative Inference, shortest path, and bAbI tasks.

PROCESS

How MEMO Handles a Paired Associative Inference Query

  1. 01

    Common embedding of inputs

    MEMO first applies the common embedding matrix Wc to each input xi, producing ci that preserves all items in each episodic fact.

  2. 02

    Multi head key and value projection

    MEMO flattens ci and uses Wk(h) and Wv(h) to create multi head keys k(h)_i and values v(h)_i, while Wq(h) embeds the query q.

  3. 03

    Recurrent attention over memory

    MEMO runs recurrent attention using Wh, Wq, DropOut, and LayerNorm so Qt iteratively focuses on linked facts across hops.

  4. 04

    Halting policy and answer prediction

    MEMO feeds Bhattacharyya distance d(Wt, Wt-1) and step t into the GRU based halting policy, then stops and outputs at through Wa and Wqa.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Paired Associative Inference task

    MEMO introduces Paired Associative Inference, including A-B-C, A-B-C-D, and A-B-C-D-E chains, to stress distant relationships across multiple facts.

  • 02

    Flexible episodic memory representation

    MEMO keeps facts separated in external memory and uses multi head recurrent attention over keys and values to support inferential reasoning.

  • 03

    REINFORCE based halting policy

    MEMO adds a REINFORCE trained halting policy with an LHop term that directly minimizes the expected number of computation steps.

RESULTS

By the Numbers

bAbI joint error 10k

0.21%

-3.99 percentage points vs Memory Networks 4.2%

PAI A C accuracy

98.26%

+37.25 percentage points vs EMN 61.01%

PAI A D accuracy

97.22%

+48.56 percentage points vs EMN 48.66%

Graph 20 5 first node

69.20%

+45.21 percentage points vs DNC 23.99%

These metrics come from Paired Associative Inference, shortest path on random graphs, and joint bAbI 10k benchmarks. MAIN_RESULT shows that MEMO reliably handles long distance reasoning while using adaptive computation.

BENCHMARK

By the Numbers

These metrics come from Paired Associative Inference, shortest path on random graphs, and joint bAbI 10k benchmarks. MAIN_RESULT shows that MEMO reliably handles long distance reasoning while using adaptive computation.

BENCHMARK

Paired Associative Inference — hardest query accuracy

Accuracy on the hardest inference query for each PAI length (A-C, A-D, A-E).

KEY INSIGHT

The Counterintuitive Finding

On the A-B-C-D-E PAI task, MEMO reaches 84.54% accuracy on A-E, while EMN stays at 45.13% and DNC at 62.61%.

This is surprising because DNC and EMN already use external memory, yet MEMO’s separated facts plus recurrent attention nearly double EMN’s performance on the longest chain.

WHY IT MATTERS

What this unlocks for the field

MEMO shows that episodic memories stored as separate facts, combined with adaptive multi hop retrieval, can support robust long distance inferential reasoning.

Builders can now design memory systems that automatically allocate more computation to harder queries, chaining multiple experiences without hand tuned hop counts or quadratic self attention.

~12 min read← Back to papers

Related papers

Memory Architecture

A Control Architecture for Training-Free Memory Use

Yanzhen Lu, Muchen Jiang et al.

· 2026

TAG routes low-confidence steps to uncertainty-based routing, filters them with guarded acceptance with rollback, chooses between bank selection across rule and exemplar memory, and prunes via evidence-based retirement inside a unified control loop. On SVAMP and ASDiv, TAG reaches 81.0% and 85.2% accuracy, improving over the 74.0% and 77.5% no-memory baselines while a compute-matched Retry baseline stays flat.

Questions about this paper?

Paper: MEMO: A Deep Network for Flexible Combination of Episodic Memories

Answers use this explainer on Memory Papers.

Checking…