Larimar: Large Language Models with Episodic Memory Control

AuthorsPayel Das, Subhajit Chaudhury, Elliot Nelson et al.

2024

TL;DR

Larimar uses one-shot hierarchical episodic memory updates to condition frozen LLM decoders, achieving up to 10× faster fact editing with 99.6–99.8% single-edit success on CounterFact.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

LLM knowledge updates are slow and fragile — even 10 edits can take 13.9s

Existing editing methods like ROME and GRACE require retraining or fact tracing, taking up to 13.9s per 10 edits on GPT-2 and 19.3s on GPT-J.

These slow, parameter-level updates make it hard to keep deployed LLMs fact-relevant, safe, and privacy-preserving, especially under sequential or batch editing workloads.

HOW IT WORKS

Larimar — episodic memory conditioned LLMs

Larimar combines a BERT large encoder, a deterministic associative memory M, a scope detector, and a GPT2-large or GPTJ-6B decoder linked by the projection WM.

Think of the encoder and decoder as neocortex, while the hierarchical memory behaves like a hippocampus that can rapidly store and replay factual episodes.

By performing one-shot least-squares write, read, and sequential writing/forgetting operations in memory space, Larimar enforces edits during decoding without touching LLM parameters or context length.

DIAGRAM

Larimar inference flow for a fact edit

This diagram shows how Larimar processes an edit example at test time using one-shot memory write, scope detection, and memory-conditioned decoding.

DIAGRAM

Larimar training and evaluation pipeline

This diagram shows how Larimar is trained on WikiText episodes and then evaluated on CounterFact and ZsRE editing benchmarks.

PROCESS

How Larimar Handles a Fact Editing Session

  1. 01

    Memory operations write read generate

    Larimar encodes an edit episode with the BERT large encoder, then uses the write operation with W0 to update the associative memory M in one shot.

  2. 02

    Sequential Writing and Forgetting

    Larimar applies the sequential update equations for Ci and Mi with αi = 1 to add facts or αi = −1 to selectively forget previously written encodings.

  3. 03

    Scope Detector

    Larimar runs the external encoding based scope detector or internal encoding based scope detector to decide if a query should trigger memory conditioned decoding.

  4. 04

    Memory conditioned decoding

    Larimar reads Zr from memory, projects it via WM into a KV cache for all GPT2 large or GPTJ 6B decoder layers, and generates edited outputs.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Episodic and adaptable memory conditioned LLM architectures

    Larimar couples a BERT large encoder, a deterministic memory M, and a GPT2 large or GPTJ 6B decoder to enable test time adaptation without any gradient based learning on edits, achieving 8–10× speedups over ROME and GRACE.

  • 02

    Knowledge editing and input context length generalization

    Larimar uses one shot memory updating and recursive memory search to handle single, sequential, and batch fact editing and to generalize from 384 or 1024 token training context to FastFacts sequences with up to 5607 tokens.

  • 03

    Selective fact forgetting and information leakage prevention

    Larimar exploits the same sequential writing and forgetting equations with αi = −1 to erase specific facts, retaining up to 0.997 recall on remaining facts and reducing rephrasing attack success to 17.6%.

RESULTS

By the Numbers

Edit Success S CounterFact GPT2

100.0%

0.0 over ROME on GPT-2 XL (both 100.0% S)

Edit Success S CounterFact GPTJ

99.8%

+0.1 over ROME on GPT-J (99.7% vs 99.8%)

Edit Retention Rate ZsRE

0.97

+0.04 over MEND and +0.04 over GRACE on sequential edits

Wall clock time 10 edits GPT2

1.1s

−3.7s vs ROME (4.8s) and −12.8s vs GRACE (13.9s)

On CounterFact and ZsRE, which test factual editing, paraphrase generalization, and neighborhood specificity, Larimar matches or exceeds ROME and GRACE accuracy while delivering up to 10× faster editing and strong sequential retention.

BENCHMARK

By the Numbers

On CounterFact and ZsRE, which test factual editing, paraphrase generalization, and neighborhood specificity, Larimar matches or exceeds ROME and GRACE accuracy while delivering up to 10× faster editing and strong sequential retention.

BENCHMARK

Single fact editing on CounterFact dataset (GPT-J)

Edit Success S on the first 2000 CounterFact samples using GPT-J based editors.

KEY INSIGHT

The Counterintuitive Finding

Larimar retains near perfect batch rewrite accuracy, staying around 100% for up to 512 edits and only dropping to 82% at 1024 edits.

This is surprising because Larimar uses a fixed 512×768 memory, yet still compresses more than K facts with higher recall than parameter editing baselines like MEND and ROME in the same batch regime.

WHY IT MATTERS

What this unlocks for the field

Larimar makes it practical to treat factual knowledge as an editable episodic store, enabling one shot updates, selective forgetting, and long context recall without retraining.

Builders can now bolt Larimar onto existing GPT2 large or GPTJ 6B style decoders to get controllable, fast, and reversible knowledge edits that scale to sequential and batch settings.

~14 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

Questions about this paper?

Paper: Larimar: Large Language Models with Episodic Memory Control

Answers use this explainer on Memory Papers.

Checking…