Human-inspired Episodic Memory for Infinite Context LLMs

AuthorsZafeirios Fountas, Martin A Benfeghoul, Adnan Oomerjee et al.

2024

TL;DR

EM-LLM uses surprise-based episodic segmentation plus graph-theoretic boundary refinement to reach 51.58% on LongBench vs 39.3% full context (+12.28 points).

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

LLMs Break Down Beyond Training Context Windows

Transformers struggle to extrapolate beyond their training window, and full-context attention over long sequences becomes computationally infeasible and noisy.

This hurts long-context tasks like multi-document QA and retrieval, where maintaining coherence and precise recall across millions of tokens is required.

HOW IT WORKS

EM-LLM — Surprise-Driven Episodic Memory for Infinite Context

EM-LLM combines surprise-based segmentation, boundary refinement, and memory retrieval to turn long token streams into coherent episodic events stored as key value caches.

Think of EM-LLM as a brain-inspired system: surprise marks event boundaries like a hippocampus, while graph metrics cluster related moments into compact, retrievable episodes.

This design lets EM-LLM retrieve only a few high-value events per layer instead of re-attending to every token, enabling practically infinite context without quadratic attention costs.

DIAGRAM

Token to Episodic Event Pipeline in EM-LLM

This diagram shows how EM-LLM segments incoming tokens into episodic events using surprise and graph-theoretic boundary refinement before storing them.

DIAGRAM

Evaluation Setup for EM-LLM on LongBench and ∞-Bench

This diagram shows how EM-LLM is evaluated against InfLLM, RAG, and full-context baselines on LongBench and ∞-Bench.

PROCESS

How EM-LLM Handles a Long-Context Query

  1. 01

    Memory formation via surprise

    EM-LLM computes token-wise surprise and uses surprise-based segmentation to propose event boundaries in the key value cache as tokens stream in.

  2. 02

    Boundary refinement

    EM-LLM applies boundary refinement using modularity or conductance on attention key similarity graphs to maximise intra event cohesion and inter event separation.

  3. 03

    Memory retrieval

    During inference, EM-LLM performs similarity-based retrieval with k NN over representative tokens, building a similarity buffer of ks episodic events.

  4. 04

    Contiguity buffer and context window

    EM-LLM enqueues neighbouring events into a contiguity buffer of size kc and combines initial tokens, buffers, and local context into the final context window.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Memory formation via surprise

    EM-LLM introduces surprise-based segmentation of token streams into episodic events, using negative log likelihood thresholds with adaptive mean and variance windows.

  • 02

    Boundary refinement

    EM-LLM refines event boundaries by optimising graph-theoretic metrics like modularity and conductance over attention key similarity, improving intra and inter similarity ratios.

  • 03

    Memory retrieval

    EM-LLM combines similarity-based k NN retrieval with a temporal contiguity buffer, achieving 100% accuracy on Passkey.Retrieval up to 10.2M tokens.

RESULTS

By the Numbers

LongBench Avg.

51.58 %

+12.28 over Full context

LongBench Avg.

51.58 %

+15.14 over RAG

∞-Bench Retrieve.PassKey

100 %

10.2M token sequences, full-context infeasible

LongBench Retrieval group

98.5 %

+1.5 over InfLLM on LLaMA 3.1

On LongBench, EM-LLM with LLaMA-3.1-8B reaches 51.58% average versus 39.3% full context and 36.44% RAG, while on ∞-Bench EM-LLM achieves 100% on Passkey.Retrieval up to 10.2M tokens. These results show EM-LLM scales long-context reasoning beyond full-context and InfLLM baselines.

BENCHMARK

By the Numbers

On LongBench, EM-LLM with LLaMA-3.1-8B reaches 51.58% average versus 39.3% full context and 36.44% RAG, while on ∞-Bench EM-LLM achieves 100% on Passkey.Retrieval up to 10.2M tokens. These results show EM-LLM scales long-context reasoning beyond full-context and InfLLM baselines.

BENCHMARK

EM-LLM vs RAG vs Full Context on LongBench

Average LongBench score (%) for EM-LLM, RAG, and full-context processing with LLaMA-3.1-8B.

KEY INSIGHT

The Counterintuitive Finding

EM-LLM surpasses full-context LLaMA-3.1-8B on LongBench, scoring 51.58% versus 39.3%, despite never seeing the entire sequence at once.

This is surprising because full-context attention should have strictly more information, yet EM-LLM’s episodic retrieval plus refinement yields a +12.28 point advantage.

WHY IT MATTERS

What this unlocks for the field

EM-LLM makes practically infinite context lengths usable by turning raw token streams into structured episodic events with efficient retrieval.

Builders can now handle 10M token histories for retrieval, QA, and coding tasks without retraining LLMs or paying quadratic attention costs.

~14 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

Questions about this paper?

Paper: Human-inspired Episodic Memory for Infinite Context LLMs

Answers use this explainer on Memory Papers.

Checking…