Understanding Factual Recall in Transformers via Associative Memories

AuthorsEshaan Nichani, Jason D. Lee, Alberto Bietti

2024

TL;DR

Understanding Factual Recall in Transformers via Associative Memories shows how treating attention and MLP weights as associative memories yields near-linear-in-parameters factual storage capacity, matching information-theoretic limits up to logarithmic factors.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Factual recall capacity without clear storage limits

Transformers on factual recall tasks can store information at a rate proportional to parameter count, but the optimal memorization capacity mechanism is unknown.

Without understanding how self attention and MLP weights store facts, builders cannot reason about scaling laws or where factual information actually lives in transformer parameters.

HOW IT WORKS

Associative memories inside shallow transformers

Understanding Factual Recall in Transformers via Associative Memories treats linear associative memories, MLP associative memories, multi head self attention, and a one layer MLP as outer product stores for subject relation answer triples.

You can think of Understanding Factual Recall in Transformers via Associative Memories like a superposed RAM table, where each weight matrix is a dense card catalog of key value pairs packed via outer products.

This associative view lets Understanding Factual Recall in Transformers via Associative Memories trade off value matrices versus MLP parameters to store Θ(N) facts, something a plain context window cannot guarantee or analyze information theoretically.

DIAGRAM

Synthetic factual recall sequence flow

This diagram shows how Understanding Factual Recall in Transformers via Associative Memories generates and processes sequences in the synthetic factual recall task.

DIAGRAM

Evaluation pipeline for associative memory capacity

This diagram shows how Understanding Factual Recall in Transformers via Associative Memories evaluates storage capacity of linear and MLP associative memories across dataset sizes.

PROCESS

How Understanding Factual Recall in Transformers via Associative Memories Handles the synthetic factual recall task

01
Data distribution over sequences
Understanding Factual Recall in Transformers via Associative Memories defines a distribution where subject tokens, relation tokens, noise tokens, EOS, and answers form length T plus one sequences.
02
One layer transformer architecture
Understanding Factual Recall in Transformers via Associative Memories instantiates a single multi head self attention layer followed by an MLP with width m and random token embeddings.
03
Associative memory constructions
Understanding Factual Recall in Transformers via Associative Memories configures value matrices and MLP weights as linear and MLP associative memories to map subject relation pairs to answers.
04
Gradient flow dynamics analysis
Understanding Factual Recall in Transformers via Associative Memories studies gradient flow on a linear attention variant, revealing a sequential hallucination stage before converging to zero loss.

KEY CONTRIBUTIONS

Key Contributions

01
Storage capacity of associative memories
Understanding Factual Recall in Transformers via Associative Memories proves that linear associative memories store N injective associations when d² ≳ N poly log N, and MLP associative memories store N associations when md ≳ N poly log N.
02
Synthetic factual recall transformer
Understanding Factual Recall in Transformers via Associative Memories introduces a synthetic next token factual recall task and shows a one layer transformer achieves 100 percent accuracy when either 4Hddh or md scales linearly with SR.
03
Sequential hallucination dynamics
Understanding Factual Recall in Transformers via Associative Memories analyzes gradient flow of a linear attention model and proves a hallucination stage where predictions match the relation conditional distribution before full factual recall.

RESULTS

By the Numbers

Stored associations linear

Θ(d²) capacity

scales with parameter count for linear associative memories

Stored associations MLP

Θ(md) capacity

scales with parameter count for MLP associative memories

Factual recall accuracy

100% on task

requires 4Hddh or md ≳ SR up to log factors

Bit complexity lower bound

N log M bits

matches associative memory constructions up to logarithmic factors

Understanding Factual Recall in Transformers via Associative Memories evaluates synthetic associative memory tasks and a factual recall distribution over S times R subject relation pairs, showing near optimal Θ(parameters) storage and exact 100 percent recall under provable scaling conditions.

BENCHMARK

Associative memory capacity versus parameter count

Number of associations N that Understanding Factual Recall in Transformers via Associative Memories can store as a function of parameter count for different memory constructions.

KEY INSIGHT

The Counterintuitive Finding

Understanding Factual Recall in Transformers via Associative Memories shows that shallow transformers can achieve 100 percent factual recall using either attention or MLP parameters alone, as long as they scale linearly with SR.

This is surprising because many expect factual knowledge to reside mainly in attention value matrices, but the results show MLP layers can fully substitute as associative memories.

WHY IT MATTERS

What this unlocks for the field

Understanding Factual Recall in Transformers via Associative Memories gives builders a principled way to size self attention and MLP blocks to hit desired factual storage targets with near optimal parameter efficiency.

Armed with these associative memory constructions and lower bounds, practitioners can design shallow or specialized transformers that reliably memorize large fact tables without overbuilding depth or relying on opaque scaling heuristics.

~14 min read← Back to papers

Related papers

Memory Architecture

A Control Architecture for Training-Free Memory Use

Yanzhen Lu, Muchen Jiang et al.

· 2026

TAG routes low-confidence steps to uncertainty-based routing, filters them with guarded acceptance with rollback, chooses between bank selection across rule and exemplar memory, and prunes via evidence-based retirement inside a unified control loop. On SVAMP and ASDiv, TAG reaches 81.0% and 85.2% accuracy, improving over the 74.0% and 77.5% no-memory baselines while a compute-matched Retry baseline stays flat.

arXiv:2604.18206 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…