Self-Attentive Associative Memory

AuthorsHung Le, Truyen Tran, Svetha Venkatesh

2020

TL;DR

Self-Attentive Associative Memory (STM) uses the SAM operator with outer-product self-attention to reach 0.39% mean error on bAbI, beating MNM-p by 0.16 points.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Neural memories lack explicit relational storage and reuse

Existing memory-augmented networks store items but not rich relationships, leading to lossy memory interactions and weak relational reasoning.

Self-Attentive Associative Memory (STM) targets tasks like Nth-farthest and Relational Associative Recall, where missing relational memory prevents correct retrieval of items conditioned on complex relationships.

HOW IT WORKS

Self-Attentive Associative Memory and the SAM-based Two-memory Model

Self-Attentive Associative Memory (STM) centers on Outer Product Attention (OPA) and the Self-attentive Associative Memory (SAM) operator, wired through Mi-Write, Mr-Read, and Mr-Transfer between item and relational memories.

You can view STM like a brain-inspired system where an associative item memory plays perirhinal cortex and a higher-order relational memory plays hippocampus, linked by outer-product-based attention.

By using SAM’s outer-product bindings instead of plain dot-product attention, STM preserves bit-level relationships that a fixed context window or scalar attention scores cannot represent or reuse.

DIAGRAM

Sequential interaction between item and relational memory in STM

This diagram shows how Self-Attentive Associative Memory (STM) processes a timestep, updating Mi and Mr and producing ot.

DIAGRAM

Evaluation pipeline across tasks for STM

This diagram shows how Self-Attentive Associative Memory (STM) is trained and evaluated on synthetic, geometric, RL, and bAbI tasks.

PROCESS

How Self-Attentive Associative Memory Handles a Sequential Task

01
Mi-Write
Self-Attentive Associative Memory (STM) uses Mi-Write to encode xt into Mi via outer-product Xt = f1(xt) ⊗ f2(xt) and gated update in Eq. 10.
02
Mr-Read
Self-Attentive Associative Memory (STM) applies Mr-Read to contract Mr with f3(xt) and f2(xt), producing vr_t that summarizes distant relational information.
03
Mi-Read Mr-Write
Self-Attentive Associative Memory (STM) feeds Mi_t and vr_t ⊗ f2(xt) into SAM, using Outer Product Attention (OPA) to update Mr_t with new hetero-associative memories.
04
Mr-Transfer and Output Distillation
Self-Attentive Associative Memory (STM) uses Mr-Transfer with G1 to enrich Mi, and applies G2 and G3 to distill Mr into the output vector ot.

KEY CONTRIBUTIONS

Key Contributions

01
SAM-based Two-memory Model (STM)
Self-Attentive Associative Memory (STM) introduces a dual system with Mi and Mr, linked by Mi-Write, Mr-Read, Mi-Read Mr-Write, and Mr-Transfer, to jointly support memorization and relational reasoning.
02
Self-attentive Associative Memory operator
Self-Attentive Associative Memory (STM) defines SAM, which uses Outer Product Attention (OPA) to transform a second-order item memory into a third-order relational memory storing d²-scalars per query.
03
State-of-the-art bAbI performance
Self-Attentive Associative Memory (STM) achieves 0.39 ± 0.18 mean error and 0.15 best error on bAbI, improving over MNM-p’s 0.55 ± 0.74 and 0.18.

RESULTS

By the Numbers

Mean error

0.39

-0.16 vs MNM-p mean error 0.55

Best error

0.15

-0.03 vs MNM-p best error 0.18

Associative retrieval epochs length 30

-25 epochs vs WeiNet 35 and -40 vs Fast weight 50

Nth-farthest accuracy

+7 points vs RMC accuracy 91

On the bAbI question answering benchmark, which tests 20 reasoning tasks, Self-Attentive Associative Memory (STM) achieves 0.39 ± 0.18 mean error and 0.15 best error. These results show that STM’s dual item–relational memory with SAM improves both accuracy and stability over prior memory networks like MNM-p and DNC.

BENCHMARK

By the Numbers

BENCHMARK

bAbI task: mean error over 20 tasks

Mean error (%) on the joint bAbI 20-task benchmark.

BENCHMARK

Nth-farthest task: test accuracy comparison

Test accuracy (%) on the Nth-farthest relational reasoning task.

KEY INSIGHT

The Counterintuitive Finding

Self-Attentive Associative Memory (STM) with nq = 8 reaches 98% accuracy on Nth-farthest, while TPR achieves only 13% despite being designed for reasoning.

This is surprising because high-order fast-weight models like TPR are expected to excel at relational tasks, yet STM’s SAM-based dual memory yields an 85-point advantage.

WHY IT MATTERS

What this unlocks for the field

Self-Attentive Associative Memory (STM) unlocks reusable, high-fidelity relational memory that can be read, updated, and distilled across long sequences.

Builders can now design agents that jointly memorize rich items and their higher-order relationships, enabling tasks like Relational Associative Recall and feature-based graph reasoning that were previously brittle or impractical.

~13 min read← Back to papers

Related papers

Memory Architecture

A Control Architecture for Training-Free Memory Use

Yanzhen Lu, Muchen Jiang et al.

· 2026

TAG routes low-confidence steps to uncertainty-based routing, filters them with guarded acceptance with rollback, chooses between bank selection across rule and exemplar memory, and prunes via evidence-based retirement inside a unified control loop. On SVAMP and ASDiv, TAG reaches 81.0% and 85.2% accuracy, improving over the 74.0% and 77.5% no-memory baselines while a compute-matched Retry baseline stays flat.

arXiv:2604.18206 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…