Understanding Transformer from the Perspective of Associative Memory

AuthorsShu Zhong, Mingyu Xu, Tenglong Ao, Guang Shi

2025

TL;DR

Understanding Transformer from the Perspective of Associative Memory shows that viewing attention and FFNs as associative memory with kernel-based updates explains Softmax’s high-capacity retrieval and motivates the more expressive DeltaFormer update rule.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Transformers Forget Under Long Contexts and Unstable Updates

Understanding Transformer from the Perspective of Associative Memory shows that linear associative memory has inverse SNR SNR^{-1}_Linear ≈ N·dk, so retrieval quickly degrades when N ≫ dk.

This means long-context Linear Attention underperforms in precise retrieval, while unregularized updates like St = St−1 + vtk⊤t cause spectral norm explosion and numerical instability.

HOW IT WORKS

Associative Memory View of Attention and FFN

Understanding Transformer from the Perspective of Associative Memory treats Softmax Attention, Linear Attention, FFN, and DeltaNet as associative memories defined by a memory matrix S and an associative map fS.

Like a brain’s hippocampus versus cortex, Softmax Attention in Understanding Transformer from the Perspective of Associative Memory is short-term contextual memory, while FFN acts as compressed long-term memory using a ReLU kernel.

This associative-memory lens lets Understanding Transformer from the Perspective of Associative Memory design DeltaFormer, which combines Softmax’s exponential kernel with DeltaNet’s delta-rule updates, enabling higher expressivity than a plain context window Transformer.

DIAGRAM

Memory Capacity and Kernel Effects in Understanding Transformer from the Perspective of Associative Memory

This diagram shows how Understanding Transformer from the Perspective of Associative Memory analyzes retrieval SNR for different kernels to explain memory capacity.

DIAGRAM

Memory Update Framework and DeltaFormer Design

This diagram shows how Understanding Transformer from the Perspective of Associative Memory unifies memory updates and derives DeltaFormer from different At, Bt, Ct choices.

PROCESS

How Understanding Transformer from the Perspective of Associative Memory Handles Associative Memory Analysis

01
Associative Memory Formalization
Understanding Transformer from the Perspective of Associative Memory defines St = Σ vi ki⊤ and an associative map fS(q) = Sϕ(q) to model storage and retrieval.
02
Memory Capacity via SNR
Understanding Transformer from the Perspective of Associative Memory derives inverse SNR for Linear Attention, Exp kernel, and ReLU kernel to quantify how many key value pairs can be reliably stored.
03
Unified Memory Update St = AtSt−1Bt + Ct
Understanding Transformer from the Perspective of Associative Memory expresses Linear Attention, Gated Linear Attention, DeltaNet, and Softmax Attention as specific At, Bt, Ct choices with explicit objectives Lt(St−1).
04
DeltaFormer Construction and Expressivity
Understanding Transformer from the Perspective of Associative Memory introduces DeltaFormer by combining Softmax’s kernel with DeltaNet’s delta rule and proves DeltaFormer can track n elements, exceeding TC0 and reaching NC1.

KEY CONTRIBUTIONS

Key Contributions

01
Retrieval SNR Analysis for Attention and FFN
Understanding Transformer from the Perspective of Associative Memory derives SNR^{-1}_Linear ≈ N·dk, SNR^{-1}_ReLU ≈ N·(2dk)^{-1}, and SNR^{-1}_exp ≈ N / exp(2(τ−1)dk/τ^2), explaining why Softmax Attention supports larger effective context.
02
Unified Memory Update Framework
Understanding Transformer from the Perspective of Associative Memory introduces St = AtSt−1Bt + Ct and maps Linear Attention, Gated Linear Attention, DeltaNet, and Softmax Attention to concrete At, Bt, Ct and Lt(St−1).
03
DeltaFormer and Expressivity Beyond TC0
Understanding Transformer from the Perspective of Associative Memory defines DeltaFormer with ut = vt − Σ exp(k⊤i kt)ui and proves DeltaFormer can track n elements for n ≥ 5, giving NC1-level expressivity.

RESULTS

By the Numbers

Inverse SNR Linear

SNR^{-1}_Linear ≈ N·dk

Baseline capacity scaling for Linear Attention

Inverse SNR ReLU

SNR^{-1}_ReLU ≈ N·(2dk)^{-1}

Halves noise versus Linear Attention for same dk

Inverse SNR Exp

SNR^{-1}_exp ≈ N / exp(2(τ−1)dk/τ^2)

Reduces required dk from O(N) to O(log^2 N)

SoLU Inverse SNR

SNR^{-1}_SoLU ≈ 5N·(dk)^{-1}·exp(−2√dk)

Explains SoLU’s higher monosemanticity than ReLU

On synthetic analyses and toy tasks, Understanding Transformer from the Perspective of Associative Memory shows how different kernels change retrieval noise scaling, and proves DeltaFormer can track element swaps and DAG reachability with 100% accuracy in settings where a comparable Transformer struggles.

BENCHMARK

By the Numbers

BENCHMARK

Qualitative Comparison of Kernel Retrieval Noise in Understanding Transformer from the Perspective of Associative Memory

Relative inverse SNR scaling SNR^{-1} as a function of N and dk for different kernels.

KEY INSIGHT

The Counterintuitive Finding

Understanding Transformer from the Perspective of Associative Memory argues that FFNs deliberately use a lower-precision ReLU kernel to encourage superposition, even though Exp kernels give far better SNR.

This is surprising because practitioners typically assume more precise kernels are always better, but Understanding Transformer from the Perspective of Associative Memory shows they can hurt compressed knowledge density and polysemantic representations.

WHY IT MATTERS

What this unlocks for the field

By unifying attention and FFNs as kernelized associative memories, Understanding Transformer from the Perspective of Associative Memory lets designers reason explicitly about memory capacity, update stability, and expressivity.

Builders can now systematically mix kernels, gating, and delta-rule updates as in DeltaFormer to design Transformers with targeted retrieval properties and provably higher circuit complexity than standard TC0-limited architectures.

~15 min read← Back to papers

Related papers

Memory Architecture

A Control Architecture for Training-Free Memory Use

Yanzhen Lu, Muchen Jiang et al.

· 2026

TAG routes low-confidence steps to uncertainty-based routing, filters them with guarded acceptance with rollback, chooses between bank selection across rule and exemplar memory, and prunes via evidence-based retirement inside a unified control loop. On SVAMP and ASDiv, TAG reaches 81.0% and 85.2% accuracy, improving over the 74.0% and 77.5% no-memory baselines while a compute-matched Retry baseline stays flat.

arXiv:2604.18206 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…