On Tiny Episodic Memories in Continual Learning

AuthorsArslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny et al.

arXiv 20192019

TL;DR

On Tiny Episodic Memories in Continual Learning uses direct Experience Replay on tiny episodic buffers to gain up to +16.7 percentage points average accuracy over FINETUNE with just 1 sample per class.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Continual learners forget past tasks despite tiny episodic memories

On Tiny Episodic Memories in Continual Learning targets catastrophic forgetting, where average forgetting can reach 0.29 for FINETUNE on MNIST with a tiny memory.

When continual learners like FINETUNE or EWC see each example only once, they quickly forget previous tasks, severely reducing average accuracy and increasing forgetting even with episodic memory.

HOW IT WORKS

Experience Replay with tiny episodic memories

On Tiny Episodic Memories in Continual Learning uses Experience Replay, Reservoir Sampling, Ring Buffer, k-Means, and Mean of Features to jointly train on current data and a tiny episodic buffer.

You can think of the episodic memory as a tiny RAM cache, where strategies like Ring Buffer and k-Means decide which "pages" of past experience stay resident.

This direct replay mechanism lets On Tiny Episodic Memories in Continual Learning generalize beyond the stored examples, something a plain context window or pure regularization cannot achieve.

DIAGRAM

Single-pass continual learning and memory writing pipeline

This diagram shows how On Tiny Episodic Memories in Continual Learning processes a single-pass task stream and updates the episodic memory using different writing strategies.

DIAGRAM

Evaluation protocol across DEV and DCV streams

This diagram shows how On Tiny Episodic Memories in Continual Learning uses DCV for hyperparameter selection and DEV for single-pass evaluation with average accuracy and forgetting.

PROCESS

How On Tiny Episodic Memories in Continual Learning Handles a Continual Task Stream

01
Protocol for Single Pass Through the Data
On Tiny Episodic Memories in Continual Learning receives DCV and DEV task streams and enforces that each example in DEV is seen only once while tracking task ids.
02
Metrics
On Tiny Episodic Memories in Continual Learning computes Average Accuracy AT and Forgetting FT over DEV using per task accuracies ai,j after each task.
03
Experience Replay
On Tiny Episodic Memories in Continual Learning samples a minibatch from the current task and another from the episodic memory, stacks them, and performs a single SGD step.
04
Reservoir Sampling and Ring Buffer
On Tiny Episodic Memories in Continual Learning updates the tiny memory using Reservoir Sampling, Ring Buffer, k-Means, or Mean of Features, and evaluates performance across Permuted MNIST, Split CIFAR, Split miniImageNet, and Split CUB.

KEY CONTRIBUTIONS

Key Contributions

01
Experience Replay with tiny episodic memories
On Tiny Episodic Memories in Continual Learning shows that direct Experience Replay with very small buffers yields gains between 7% and 17% in average accuracy when using a single example per class.
02
Analysis of memory writing strategies
On Tiny Episodic Memories in Continual Learning compares Reservoir Sampling, Ring Buffer, k-Means, and Mean of Features, and proposes a hybrid strategy that switches from reservoir to ring buffer as classes become underrepresented.
03
Generalization analysis under repetitive replay
On Tiny Episodic Memories in Continual Learning analyzes why repetitive training on tiny memories does not harm generalization, showing that subsequent task data acts as a data dependent regularizer even when the memory is perfectly memorized.

RESULTS

By the Numbers

Average Accuracy MNIST

0.80

+16.7 percentage points over FINETUNE with 1 sample per class using ER Ring Buffer

Average Accuracy CIFAR

0.56

+15.6 percentage points over FINETUNE and +15 percentage points over EWC with 1 sample per class

Average Accuracy CUB

0.64

+9.3 percentage points over FINETUNE and +10 percentage points over EWC with 1 sample per class

Average Accuracy miniImageNet

0.49

+14.3 percentage points over FINETUNE and +11.3 percentage points over EWC with 1 sample per class

These metrics come from Permuted MNIST, Split CIFAR, Split CUB, and Split miniImageNet under a single pass continual learning protocol, demonstrating that On Tiny Episodic Memories in Continual Learning leverages tiny episodic memories to substantially reduce forgetting compared to FINETUNE and EWC.

BENCHMARK

By the Numbers

BENCHMARK

Average forgetting with tiny episodic memory of single example per class

Forgetting FT across benchmarks when using 1 sample per class in the episodic memory.

KEY INSIGHT

The Counterintuitive Finding

On Tiny Episodic Memories in Continual Learning shows that with only 1 sample per class, Experience Replay can improve average accuracy by between 7% and 17% instead of overfitting.

This is counterintuitive because repetitive training on a tiny memory, which is perfectly memorized, was expected to harm generalization rather than improve it.

WHY IT MATTERS

What this unlocks for the field

On Tiny Episodic Memories in Continual Learning proves that tiny episodic memories plus Experience Replay can be both computationally cheap and highly effective for continual learning.

This enables builders to design continual learners that operate under strict memory and compute budgets while still maintaining strong performance across long task sequences.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…