Episodic Memory in Lifelong Language Learning

AuthorsCyprien de Masson d'Autume, Sebastian Ruder, Lingpeng Kong, Dani Yogatama

arXiv 20192019

TL;DR

Episodic Memory in Lifelong Language Learning adds sparse experience replay plus local adaptation on a key-value memory, reaching 70.6 accuracy vs 66.9 for A-GEM on classification.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Lifelong language models catastrophically forget past datasets

Episodic Memory in Lifelong Language Learning targets catastrophic forgetting where standard encoder decoder training yields only 18.4 averaged classification accuracy across datasets.

When text classification and question answering datasets arrive sequentially without identifiers, catastrophic forgetting makes previously learned classes and answer spans unusable for later test examples.

HOW IT WORKS

Episodic memory with sparse replay and local adaptation

Episodic Memory in Lifelong Language Learning uses a frozen example encoder, task decoder, and episodic memory with a separate key network for sparse experience replay and local adaptation.

Think of the episodic memory like a hippocampus backed card catalog, where BERT keys index past examples and MBPA++ temporarily rewires parameters for each query.

This KEY based MBPA++ adaptation lets Episodic Memory in Lifelong Language Learning specialize predictions per test example beyond what a fixed context window or static multitask training can express.

DIAGRAM

Local adaptation and retrieval during inference

This diagram shows how Episodic Memory in Lifelong Language Learning performs MBPA++ local adaptation using K nearest neighbors from episodic memory at inference time.

DIAGRAM

Training loop with sparse experience replay

This diagram shows how Episodic Memory in Lifelong Language Learning trains on a single pass stream with 1 percent sparse experience replay from episodic memory.

PROCESS

How Episodic Memory in Lifelong Language Learning Handles a Lifelong Training Stream

01
Example encoder
Episodic Memory in Lifelong Language Learning feeds each xt into the BERT based example encoder to obtain token representations and CLS vectors for classification or span prediction.
02
Task decoder
Using encoder outputs, Episodic Memory in Lifelong Language Learning applies the text classification softmax over 33 classes or the span predictor with wstart and wend for question answering.
03
Episodic memory
Episodic Memory in Lifelong Language Learning computes a frozen BERT key ut via the key network and optionally writes the pair ⟨xt, yt⟩ into the key value episodic memory.
04
Sparse experience replay and local adaptation
Every 10,000 new examples, Episodic Memory in Lifelong Language Learning replays 100 memory samples, and at inference it retrieves K neighbors to run MBPA++ local adaptation using Eq. 1.

KEY CONTRIBUTIONS

Key Contributions

01
Lifelong language learning setup without dataset identifiers
Episodic Memory in Lifelong Language Learning defines a single pass stream over multiple datasets of the same task, using a shared example encoder and task decoder without any dataset descriptors.
02
Episodic memory with sparse experience replay and local adaptation
Episodic Memory in Lifelong Language Learning augments BERT with a key value episodic memory, using random write, random sampling replay at 1 percent rate, and MBPA++ local adaptation over K neighbors.
03
Space efficient random write strategy
Episodic Memory in Lifelong Language Learning shows that storing only 10 percent of examples via random write still yields 67.6 classification accuracy versus 70.6 with full memory, reducing memory by 90 percent.

RESULTS

By the Numbers

class.-avg. accuracy

70.6

+3.7 over A-GEM

QA-avg. F1 score

62.4

+4.5 over REPLAY

Memory 10 percent class.-avg.

67.6

-3.0 vs full MBPA++ memory

Replay rate

1.0

100 examples every 10000 new examples

On concatenated text classification datasets from Zhang et al. and QA datasets SQuAD, TriviaQA, and QuAC, Episodic Memory in Lifelong Language Learning improves averaged classification accuracy to 70.6 and QA F1 to 62.4, demonstrating that combining sparse experience replay with MBPA++ mitigates catastrophic forgetting compared to A-GEM and REPLAY baselines.

BENCHMARK

By the Numbers

BENCHMARK

Summary of results on text classification using averaged accuracy

Macro averaged accuracy across AGNews, Yelp, DBPedia, Amazon, and Yahoo in the lifelong setup.

KEY INSIGHT

The Counterintuitive Finding

Episodic Memory in Lifelong Language Learning maintains 67.6 averaged classification accuracy when storing only 10 percent of examples, just 3.0 points below full memory.

This is surprising because naive intuition suggests aggressive memory subsampling would severely harm retrieval quality and MBPA++ adaptation, yet random write preserves most performance.

WHY IT MATTERS

What this unlocks for the field

Episodic Memory in Lifelong Language Learning shows that a frozen key network plus MBPA++ can support continual BERT based language learning without dataset identifiers.

Builders can now design lifelong NLP systems that replay sparsely, adapt locally per query, and scale episodic memory with simple random write policies instead of expensive full rehearsal.

~11 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…