Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers

AuthorsYibo Jiang, Goutham Rajendran, Pradeep Ravikumar, Bryon Aragam

2024

TL;DR

Do LLMs dream of elephants uses a latent concept association task to show transformers store associative memory primarily in the value matrix, which can perfectly recall under idealized conditions.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Fact retrieval can be hijacked by harmless context changes

Do LLMs dream of elephants reports that GPT-2 changes its answer from Paris to Chicago when the prompt is extended with “The Eiffel Tower is not in Chicago.”.

On the CounterFact fact retrieval task, Do LLMs dream of elephants shows that prepending “Do not think of {{target_false}}” repeatedly sharply increases the Efficacy Score, meaning LLMs systematically favor false answers.

HOW IT WORKS

Latent concept association with transformers

Do LLMs dream of elephants introduces latent concept association, then analyzes how a one-layer transformer’s self-attention layer, value matrix, and embedding matrix jointly solve this memory task.

Do LLMs dream of elephants treats the value matrix like a content-addressable memory table and the embeddings like addresses, similar to a Hopfield-style RAM that stores patterns in weight outer products.

Do LLMs dream of elephants shows that this design lets transformers aggregate noisy context statistics and then retrieve the correct token using associative memory, beyond what a plain context window or identity value matrix can do.

DIAGRAM

Latent concept association data flow

This diagram shows how Do LLMs dream of elephants generates synthetic contexts from latent binary concepts and trains a one-layer transformer to predict the associated token.

DIAGRAM

Context hijacking evaluation pipeline

This diagram shows how Do LLMs dream of elephants evaluates context hijacking on CounterFact by prepending hijack prompts and measuring Efficacy Score.

PROCESS

How Do LLMs dream of elephants Handles latent concept association

  1. 01

    Latent concept association

    Do LLMs dream of elephants defines binary latent variables Z and maps each vector z to a token via the tokenizer, creating a latent concept space.

  2. 02

    Latent conditional distribution

    Do LLMs dream of elephants samples context tokens from the mixture p(z|z*) = ωπ(z|z*) + (1−ω)Unif(Z), where π favors low Hamming distance neighbors.

  3. 03

    Transformer network architecture

    Do LLMs dream of elephants applies a one-layer transformer f^L(x) with self-attention attn(WE χ(x)) followed by the value matrix WV and tied embeddings WE.

  4. 04

    Hypothetical associative memory model

    Do LLMs dream of elephants constructs WV = Σ_t WE(t)(Σ_{t'∈N1(t)} WE(t')^T) and shows this achieves arbitrarily small error as context length L grows.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Context hijacking in LLMs

    Do LLMs dream of elephants systematically demonstrates context hijacking on GPT-2, LLaMA-2-7B, and Gemma using CounterFact, where prepending hijack text sharply increases the Efficacy Score of false answers.

  • 02

    Latent concept association task

    Do LLMs dream of elephants proposes latent concept association, where similarity is defined in a latent binary concept space and contexts are sampled from p(z|z*).

  • 03

    Associative memory in transformers

    Do LLMs dream of elephants proves that self-attention aggregates context statistics while the value matrix implements associative memory, and shows embeddings develop low-rank, Hamming-distance-aware geometry.

RESULTS

By the Numbers

Efficacy Score

proportion of samples with Pr[o_] > Pr[o*]

increases as hijack prepends k grow from 0 to 5 on CounterFact

Context length L

64 and 128 tokens

longer L reduces error R_DL(f) toward 0 in Theorem 1 and Theorem 4

Vocabulary size V

2^m tokens

each token corresponds to an m-bit latent vector z

Latent dimension m

m ≥ 3 bits

guarantees existence of low-error associative memory construction

Do LLMs dream of elephants evaluates context hijacking on the CounterFact dataset and latent concept association on synthetic data, showing that hijack prepends raise Efficacy Score while longer contexts and associative value matrices drive recall error arbitrarily low.

BENCHMARK

By the Numbers

Do LLMs dream of elephants evaluates context hijacking on the CounterFact dataset and latent concept association on synthetic data, showing that hijack prepends raise Efficacy Score while longer contexts and associative value matrices drive recall error arbitrarily low.

BENCHMARK

Context hijacking on CounterFact with hijack prepends

Efficacy Score ES: proportion of CounterFact prompts where Pr[target_false] > Pr[target_true] after hijacking.

KEY INSIGHT

The Counterintuitive Finding

Do LLMs dream of elephants shows that adding factually correct text like “The Eiffel Tower is not located in Guam” can still push LLMs toward the false answer Guam.

This is surprising because humans see both prompts as semantically unambiguous, yet LLMs behave like associative memories that over-weight surface token co-occurrence instead of underlying meaning.

WHY IT MATTERS

What this unlocks for the field

Do LLMs dream of elephants gives a concrete associative memory model for transformers, clarifying how value matrices and embeddings store concept associations.

Armed with this view, builders can design more robust prompting, editing, and fine-tuning schemes that manipulate low-rank memory structure instead of blindly relying on context semantics.

~14 min read← Back to papers

Related papers

Memory Architecture

A Control Architecture for Training-Free Memory Use

Yanzhen Lu, Muchen Jiang et al.

· 2026

TAG routes low-confidence steps to uncertainty-based routing, filters them with guarded acceptance with rollback, chooses between bank selection across rule and exemplar memory, and prunes via evidence-based retirement inside a unified control loop. On SVAMP and ASDiv, TAG reaches 81.0% and 85.2% accuracy, improving over the 74.0% and 77.5% no-memory baselines while a compute-matched Retry baseline stays flat.

Questions about this paper?

Paper: Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers

Answers use this explainer on Memory Papers.

Checking…