Hierarchical Associative Memory

AuthorsDmitry Krotov

arXiv 20212021

TL;DR

Hierarchical Associative Memory uses layer‑wise Lagrangian‑defined activations and symmetric feedback weights to build deep Modern Hopfield Networks with a global energy that guarantees convergence to fixed‑point attractors.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Associative memories with only one hidden layer and dense connectivity

Classical Hopfield Networks have a small memory storage capacity that scales linearly with the number of input features, limiting usefulness.

Dense Associative Memories before Hierarchical Associative Memory had only one hidden layer and dense connectivity, restricting representational richness and inductive biases for machine learning tasks.

HOW IT WORKS

Hierarchical Associative Memory — layered Lagrangians and symmetric feedback

Hierarchical Associative Memory combines Lagrangian functions, hierarchical layered networks, symmetric feedforward and feedback weights, and hierarchical time scales to build deep recurrent associative memories with a global energy.

You can think of Hierarchical Associative Memory like a multi‑level Hopfield pyramid, where lower layers store reusable primitives and higher layers act as assembly rules, similar to RAM blocks orchestrated by a controller.

This Lagrangian‑based construction lets Hierarchical Associative Memory implement collective activations like softmax attention and convolutional layers while still guaranteeing energy descent and fixed‑point retrieval beyond a plain context window.

DIAGRAM

Dynamical evolution and energy descent in Hierarchical Associative Memory

This diagram shows how neuron states in Hierarchical Associative Memory evolve over time under the layered dynamical equations while the global energy monotonically decreases.

DIAGRAM

Architectural variants of Hierarchical Associative Memory

This diagram shows how Hierarchical Associative Memory instantiates one‑layer, two‑layer dense, and convolutional architectures using the same Lagrangian formalism.

PROCESS

How Hierarchical Associative Memory Handles a self supervised denoising task

01
Present noisy input
Hierarchical Associative Memory initializes the input layer x with a noisy pattern while higher layers y and z start from zero activities.
02
Bottom up propagation
Using the dynamical equations with matrices Ξ and Ψ, Hierarchical Associative Memory propagates signals from x to y and z through Lagrangian defined activations.
03
Top down feedback
Symmetric feedback weights send signals from higher layers back to lower layers, letting primitives in x be refined by assembled patterns in y and z.
04
Converge to fixed point
Under hierarchical time scales τ1 ≫ τ2 ≫ τ3, Hierarchical Associative Memory follows decreasing energy until reaching a fixed point that denoises the input.

KEY CONTRIBUTIONS

Key Contributions

01
General formulation of Modern Hopfield Network
Hierarchical Associative Memory provides a fully connected Modern Hopfield Network where every neuron has its own activation function and kinetic time constant, defined via a global Lagrangian.
02
Hierarchical layered model of associative memory
Hierarchical Associative Memory stacks multiple layers with dense or local connectivity, symmetric feedforward and feedback weights, and collective activations like softmax attention.
03
Energy functions and convergence conditions
Hierarchical Associative Memory derives explicit non linear energy functions for layered architectures and states convergence conditions based on positive semi definite Hessians.

RESULTS

By the Numbers

Memory storage capacity

super linear scaling

decoupled from input dimensionality compared to classical Hopfield Networks

Hidden layers

arbitrary depth

extends one hidden layer Dense Associative Memories

Connectivity type

dense and local

adds convolutional and pooling variants to Modern Hopfield Networks

Temporal dynamics

fixed point only

global energy excludes limit cycles and chaotic behavior

Hierarchical Associative Memory is a theoretical architecture paper without benchmark tables, so the main quantitative claims concern storage capacity scaling, allowable depth, and convergence guarantees rather than dataset scores.

BENCHMARK

Conceptual comparison: classical Hopfield, Dense Associative Memory, Hierarchical Associative Memory

Relative memory storage capacity scaling with number of feature neurons, as discussed in the introduction.

KEY INSIGHT

The Counterintuitive Finding

Hierarchical Associative Memory shows that even with collective activations like softmax attention, a global Lyapunov energy still exists without inverting activation functions.

This is surprising because previous energy based formulations typically required neuron wise nonlinearities with invertible activations, excluding many practical architectures such as softmax attention layers.

WHY IT MATTERS

What this unlocks for the field

Hierarchical Associative Memory unlocks deep, convolutional, and attention like associative memories that provably converge to fixed points under rich feedback.

Builders can now design biologically inspired recurrent architectures with reusable primitives and top down context, while retaining analytical energy functions and tractable training schemes.

~14 min read← Back to papers

Related papers

Memory Architecture

A Control Architecture for Training-Free Memory Use

Yanzhen Lu, Muchen Jiang et al.

· 2026

TAG routes low-confidence steps to uncertainty-based routing, filters them with guarded acceptance with rollback, chooses between bank selection across rule and exemplar memory, and prunes via evidence-based retirement inside a unified control loop. On SVAMP and ASDiv, TAG reaches 81.0% and 85.2% accuracy, improving over the 74.0% and 77.5% no-memory baselines while a compute-matched Retry baseline stays flat.

arXiv:2604.18206 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…