Memorization to Generalization: Emergence of Diffusion Models from Associative Memory

AuthorsBao Pham, Gabriel Raya, Matteo Negri et al.

arXiv 20252025

TL;DR

Memorization to Generalization uses an associative-memory energy view of diffusion models to show spurious attractors emerge as the first step toward generalization.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Diffusion models memorize training data instead of generalizing

Memorization to Generalization notes that diffusion models can replicate training samples, raising privacy, security, and unclear generalization behavior concerns.

Memorization to Generalization targets high dimensional generative modeling, where uncontrolled memorization risks data leakage and obscures how creative generalization actually emerges.

HOW IT WORKS

Diffusion as Dense Associative Memory energy dynamics

Memorization to Generalization connects the DM energy EDM(x, t) to DenseAM EAM(x), treating training samples as memories and generated samples as attractor states in a shared energy landscape.

You can think of Memorization to Generalization as turning a diffusion model into a content‑addressable RAM, where noisy queries relax into stored or emergent patterns instead of simple next‑token predictions.

This DenseAM view lets Memorization to Generalization explain how spurious attractors with sizable basins appear before a smooth low energy manifold, something a plain context window or black box score network cannot reveal.

DIAGRAM

Memorization–spurious–generalization transition in the energy landscape

This diagram shows how Memorization to Generalization conceptualizes the evolution of diffusion energy landscapes from memorized minima to spurious attractors to a continuous low energy manifold as training size increases.

DIAGRAM

Evaluation pipeline for detecting memorized, spurious, and generalized samples

This diagram shows how Memorization to Generalization constructs training and synthetic sets, computes nearest neighbor histograms, and classifies samples into memorized, spurious, and generalized types.

PROCESS

How Memorization to Generalization Handles the memorization to generalization transition

01
Forward process SDE
Memorization to Generalization applies the stochastic differential equation dxt = f(xt, t)dt + g(t)dwt with f(xt, t) = 0 and g(t) = σ to corrupt training samples.
02
Score matching training
Memorization to Generalization trains the score network sθ(xt, t) to approximate ∇xt log pt(xt), implicitly learning the DenseAM style energy EDM(xt, t).
03
Reverse process sampling
Memorization to Generalization runs the reverse SDE dxt = [f(xt, t) − g(t)2∇xt log pt(xt)]dt + g(t)dwt to perform associative memory style retrieval or generation.
04
Sample classification metrics
Memorization to Generalization classifies outputs using M(ˆx, S), S(ˆx, S, S′), and G(ˆx, S, S′) based on nearest neighbor distances and thresholds δm and δs.

KEY CONTRIBUTIONS

Key Contributions

01
Energy link between diffusion models and DenseAMs
Memorization to Generalization derives EDM(xt, t) = −2σt2 log ∑µ exp(−∥xt − ξµ∥2 / 2σt2) and shows its direct correspondence to DenseAM EAM(x) with β = 1 / 2σt2.
02
Discovery of spurious states in diffusion models
Memorization to Generalization empirically identifies spurious attractors with sizable basins that appear at the memorization–generalization boundary across MNIST, FASHION-MNIST, CIFAR10, and LSUN-CHURCH.
03
Geometric characterization via basins and curvature
Memorization to Generalization measures basin volumes V(ˆx, xtc) and curvature spectra from the score Jacobian, showing memorized > spurious > generalized in log volume and curvature, and extends this analysis to Stable Diffusion.

RESULTS

By the Numbers

Basin log volume

memorized > spurious > generalized

memorized samples have the largest log volume gap over generalized samples

Curvature singular values

memorized > spurious > generalized

memorized states show fewer near zero singular values than generalized states

Spurious fraction peak

single peak at critical K

spurious state fraction rises then falls as K crosses memory capacity

Toy circle K values

K = 2, 9, 1000

K = 2 memorizes, K = 9 shows spurious, K = 1000 fully generalizes

Memorization to Generalization evaluates across MNIST, FASHION-MNIST, CIFAR10, LSUN-CHURCH, a 2D unit circle toy model, and Stable Diffusion, showing consistent three phase behavior and distinct geometric signatures for memorized, spurious, and generalized states.

BENCHMARK

Fractions of memorized, spurious, and generalized samples across training sizes

Qualitative distribution of sample types in Memorization to Generalization as training size K increases.

KEY INSIGHT

The Counterintuitive Finding

Memorization to Generalization finds that spurious states, long seen as harmful artifacts in associative memory, are actually the first signs of generative capability.

This is surprising because classical Hopfield network theory treats spurious attractors as retrieval failures, but Memorization to Generalization shows they are precisely where diffusion creativity begins.

WHY IT MATTERS

What this unlocks for the field

Memorization to Generalization gives practitioners an energy based lens and concrete diagnostics to locate memorized, spurious, and generalized regimes in diffusion models.

With this, builders can tune dataset size, architecture, and training to avoid privacy harming memorization while intentionally targeting the spurious to generalized transition where useful novelty emerges.

~12 min read← Back to papers

Related papers

Memory Architecture

A Control Architecture for Training-Free Memory Use

Yanzhen Lu, Muchen Jiang et al.

· 2026

TAG routes low-confidence steps to uncertainty-based routing, filters them with guarded acceptance with rollback, chooses between bank selection across rule and exemplar memory, and prunes via evidence-based retirement inside a unified control loop. On SVAMP and ASDiv, TAG reaches 81.0% and 85.2% accuracy, improving over the 74.0% and 77.5% no-memory baselines while a compute-matched Retry baseline stays flat.

arXiv:2604.18206 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…