Gradient Episodic Memory for Continual Learning

AuthorsDavid Lopez-Paz, Marc'Aurelio Ranzato

arXiv 20172017

TL;DR

Gradient Episodic Memory uses constrained gradient projection with an episodic memory to reach 0.654 ACC on CIFAR100 with 5,120 memories, +0.146 over iCARL.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Continual learners suffer catastrophic forgetting in non iid task streams

Gradient Episodic Memory targets catastrophic forgetting, where straightforward ERM over a continuum of tasks causes the learner to forget how to solve past tasks.

In continual learning with many tasks and single passes, non iid input data and missing replay make neural networks lose previously acquired knowledge, degrading performance on earlier tasks.

HOW IT WORKS

Gradient Episodic Memory — constrained gradients with episodic memory

Gradient Episodic Memory introduces episodic memory Mt, loss constraints ℓ(fθ, Mk), and a GEM QP that projects gradients to avoid increasing past-task losses.

You can think of Gradient Episodic Memory like RAM plus a safety controller: a small memory buffer stores key examples, and a QP solver checks every update against these stored experiences.

This constrained projection lets Gradient Episodic Memory achieve positive backward transfer that a plain context window or naive replay cannot, while tightly controlling forgetting across tasks.

DIAGRAM

Continual learning step for one task in Gradient Episodic Memory

This diagram shows how Gradient Episodic Memory processes a single example, enforces gradient constraints from episodic memories, and updates parameters during continual learning.

DIAGRAM

Training and evaluation pipeline for Gradient Episodic Memory

This diagram shows how Gradient Episodic Memory trains across T tasks and fills the evaluation matrix R for ACC, BWT, and FWT.

PROCESS

How Gradient Episodic Memory Handles a Continuum of Tasks

01
TRAIN procedure
Gradient Episodic Memory runs the TRAIN procedure, iterating over tasks t and streaming examples while maintaining episodic memories Mt and the evaluation matrix R.
02
PROJECT step
For each example, Gradient Episodic Memory computes g and past gradients gk, then calls the PROJECT step that solves the GEM QP to obtain the projected gradient.
03
EVALUATE procedure
After finishing each task, Gradient Episodic Memory calls EVALUATE to compute accuracies over all tasks, filling a row of R for ACC, BWT, and FWT.
04
Causal compression view
Using its constraints and episodic memory, Gradient Episodic Memory learns correlations common across tasks, supporting predictions without task descriptors under a causal compression view.

KEY CONTRIBUTIONS

Key Contributions

01
A framework for continual learning
Gradient Episodic Memory formalizes a continuum of tasks with descriptors ti and introduces metrics ACC, BWT, and FWT based on an evaluation matrix R ∈ R^{T×T}.
02
Gradient Episodic Memory algorithm
Gradient Episodic Memory uses episodic memory Mt and a GEM QP that projects gradients so that ⟨g~, gk⟩ ≥ 0, allowing positive backward transfer while avoiding forgetting.
03
Empirical results on MNIST and CIFAR100
Gradient Episodic Memory reaches 0.654 ACC on Incremental CIFAR100 with 5,120 memories and matches iid training ACC on MNIST rotations with minimal negative BWT.

RESULTS

By the Numbers

ACC CIFAR100

0.654

+0.146 over iCARL at 5,120 memory size

ACC CIFAR100

0.633

+0.133 over iCARL at 2,560 memory size

ACC CIFAR100

0.579

+0.085 over iCARL at 1,280 memory size

ACC CIFAR100

0.487

+0.051 over iCARL at 200 memory size

On Incremental CIFAR100 with 20 tasks, Gradient Episodic Memory is evaluated under a single-pass continual learning protocol. These ACC numbers show that Gradient Episodic Memory consistently improves over iCARL as episodic memory grows, demonstrating effective use of stored examples and constrained updates.

BENCHMARK

By the Numbers

BENCHMARK

ACC as a function of episodic memory size on CIFAR100

Final ACC on Incremental CIFAR100 after 20 tasks for Gradient Episodic Memory and iCARL at different memory sizes.

KEY INSIGHT

The Counterintuitive Finding

On MNIST rotations with five epochs per task, Gradient Episodic Memory achieves 0.89 ACC with BWT −0.02, nearly matching 0.89 ACC with BWT −0.00 from iid shufﬂed training.

This is surprising because repeated passes usually exacerbate catastrophic forgetting, yet Gradient Episodic Memory maintains near oracle performance while memory-less baselines drop to 0.43–0.67 ACC with much more negative BWT.

WHY IT MATTERS

What this unlocks for the field

Gradient Episodic Memory shows that a single network can learn long sequences of tasks with minimal forgetting by enforcing gradient constraints from a small episodic memory.

This enables builders to design continual learners that approach iid multitask performance in realistic streaming settings, without freezing parameters or spawning separate models per task.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…