Larimar: Large Language Models with Episodic Memory Control

AuthorsPayel Das, Subhajit Chaudhury, Elliot Nelson et al.

2024

TL;DR

Larimar uses one-shot hierarchical episodic memory updates to condition frozen LLM decoders, achieving up to 10× faster fact editing with 99.6–99.8% single-edit success on CounterFact.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

LLM knowledge updates are slow and fragile — even 10 edits can take 13.9s

Existing editing methods like ROME and GRACE require retraining or fact tracing, taking up to 13.9s per 10 edits on GPT-2 and 19.3s on GPT-J.

These slow, parameter-level updates make it hard to keep deployed LLMs fact-relevant, safe, and privacy-preserving, especially under sequential or batch editing workloads.

HOW IT WORKS

Larimar — episodic memory conditioned LLMs

Larimar combines a BERT large encoder, a deterministic associative memory M, a scope detector, and a GPT2-large or GPTJ-6B decoder linked by the projection WM.

Think of the encoder and decoder as neocortex, while the hierarchical memory behaves like a hippocampus that can rapidly store and replay factual episodes.

By performing one-shot least-squares write, read, and sequential writing/forgetting operations in memory space, Larimar enforces edits during decoding without touching LLM parameters or context length.

DIAGRAM

Larimar inference flow for a fact edit

This diagram shows how Larimar processes an edit example at test time using one-shot memory write, scope detection, and memory-conditioned decoding.

DIAGRAM

Larimar training and evaluation pipeline

This diagram shows how Larimar is trained on WikiText episodes and then evaluated on CounterFact and ZsRE editing benchmarks.

PROCESS

How Larimar Handles a Fact Editing Session

01
Memory operations write read generate
Larimar encodes an edit episode with the BERT large encoder, then uses the write operation with W0 to update the associative memory M in one shot.
02
Sequential Writing and Forgetting
Larimar applies the sequential update equations for Ci and Mi with αi = 1 to add facts or αi = −1 to selectively forget previously written encodings.
03
Scope Detector
Larimar runs the external encoding based scope detector or internal encoding based scope detector to decide if a query should trigger memory conditioned decoding.
04
Memory conditioned decoding
Larimar reads Zr from memory, projects it via WM into a KV cache for all GPT2 large or GPTJ 6B decoder layers, and generates edited outputs.

KEY CONTRIBUTIONS

Key Contributions

01
Episodic and adaptable memory conditioned LLM architectures
Larimar couples a BERT large encoder, a deterministic memory M, and a GPT2 large or GPTJ 6B decoder to enable test time adaptation without any gradient based learning on edits, achieving 8–10× speedups over ROME and GRACE.
02
Knowledge editing and input context length generalization
Larimar uses one shot memory updating and recursive memory search to handle single, sequential, and batch fact editing and to generalize from 384 or 1024 token training context to FastFacts sequences with up to 5607 tokens.
03
Selective fact forgetting and information leakage prevention
Larimar exploits the same sequential writing and forgetting equations with αi = −1 to erase specific facts, retaining up to 0.997 recall on remaining facts and reducing rephrasing attack success to 17.6%.

RESULTS

By the Numbers

Edit Success S CounterFact GPT2

100.0%

0.0 over ROME on GPT-2 XL (both 100.0% S)

Edit Success S CounterFact GPTJ

99.8%

+0.1 over ROME on GPT-J (99.7% vs 99.8%)

Edit Retention Rate ZsRE

0.97

+0.04 over MEND and +0.04 over GRACE on sequential edits

Wall clock time 10 edits GPT2

1.1s

−3.7s vs ROME (4.8s) and −12.8s vs GRACE (13.9s)

On CounterFact and ZsRE, which test factual editing, paraphrase generalization, and neighborhood specificity, Larimar matches or exceeds ROME and GRACE accuracy while delivering up to 10× faster editing and strong sequential retention.

BENCHMARK

By the Numbers

BENCHMARK

Single fact editing on CounterFact dataset (GPT-J)

Edit Success S on the first 2000 CounterFact samples using GPT-J based editors.

KEY INSIGHT

The Counterintuitive Finding

Larimar retains near perfect batch rewrite accuracy, staying around 100% for up to 512 edits and only dropping to 82% at 1024 edits.

This is surprising because Larimar uses a fixed 512×768 memory, yet still compresses more than K facts with higher recall than parameter editing baselines like MEND and ROME in the same batch regime.

WHY IT MATTERS

What this unlocks for the field

Larimar makes it practical to treat factual knowledge as an editable episodic store, enabling one shot updates, selective forgetting, and long context recall without retraining.

Builders can now bolt Larimar onto existing GPT2 large or GPTJ 6B style decoders to get controllable, fast, and reversible knowledge edits that scale to sequential and batch settings.

~14 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…