Preventing Catastrophic Forgetting through Memory Networks in Continuous Detection

AuthorsGaurav Bhatt, James Ross, Leonid Sigal

arXiv 20242024

TL;DR

MD-DETR uses localized memory retrieval plus background thresholding to prevent catastrophic forgetting, improving MS-COCO continual detection mAP@A to 50.2 vs 39.2 for CL-DETR (+11.0).

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Continual detectors forget past classes via background relegation

MD-DETR targets continual object detection where existing systems lose 10–15 mAP points on past classes across tasks, even with replay buffers.

When categories like person reappear unlabeled in later tasks, DETR-style detectors treat them as background, causing catastrophic forgetting and unstable stability–plasticity trade offs.

HOW IT WORKS

MD-DETR — Memory augmented Deformable DETR with localized retrieval

MD-DETR combines a frozen Deformable-DETR backbone with task-partitioned memory modules, a learnable query function, and background thresholding to stabilize continual detection.

You can think of MD-DETR as adding an external RAM bank to Deformable-DETR, where a learned controller picks a few relevant memory rows per image instead of rewriting weights.

This localized retrieval lets MD-DETR reuse past task knowledge without replay buffers, something a plain transformer context window or naive fine tuning cannot achieve.

DIAGRAM

Localized query and memory retrieval pipeline in MD-DETR

This diagram shows how MD-DETR ranks proposals, builds a localized query, and retrieves a weighted memory combination during inference.

DIAGRAM

Continual training loop and evaluation for MD-DETR

This diagram shows how MD-DETR is trained over tasks with background thresholding and then evaluated using mAP@P, mAP@C, and mAP@A.

PROCESS

How MD-DETR Handles a Continual Detection Task

01
Memory modules for Deformable-DETR
MD-DETR freezes the Deformable-DETR encoder and decoder, then allocates task specific chunks of memory modules M with Nm units and length Lm.
02
Query function for localized memory retrieval
MD-DETR uses the ranking function gψ to score proposals, builds Q(x, θ∇, α), and regularizes it with LQ using Hungarian assignments.
03
Continual optimization for θTt∗
MD-DETR updates only θTt∗, masking gradients on past class embeddings and using a reduced learning rate for bounding box embeddings.
04
Continual optimization for solving background relegation
MD-DETR applies background thresholding with δbt, generating pseudo labels for past classes and training with Ldetr + λQ LQ.

KEY CONTRIBUTIONS

Key Contributions

01
Memory augmented Deformable DETR
MD-DETR integrates frozen Deformable-DETR with task partitioned memory modules M, class embedding, and bounding box embedding for replay free continual detection.
02
Localized query retrieval mechanism
MD-DETR introduces a learnable query function Q(x, θ∇, α) with ranking function gψ and LQ regularization to retrieve a weighted memory combination.
03
Continual optimization for background relegation
MD-DETR proposes background thresholding with δbt and gradient masking, achieving about 5–7 mAP improvements on MS-COCO and PASCAL-VOC.

RESULTS

By the Numbers

Task 4 mAP@A

50.2

+11.0 over CL-DETR

Task 4 mAP@P

51.5

+13.3 over PROB

VOC 10+10 mAP@A

73.2

+6.7 over PROB

VOC 19+1 mAP@A

76.1

+3.5 over PROB

These metrics come from MS-COCO multi step and PASCAL-VOC A+B continual detection benchmarks, showing that MD-DETR maintains past classes while improving overall mAP without any replay buffer.

BENCHMARK

By the Numbers

These metrics come from MS-COCO multi step and PASCAL-VOC A+B continual detection benchmarks, showing that MD-DETR maintains past classes while improving overall mAP without any replay buffer.

BENCHMARK

MS-COCO multi step continual detection on Task 4

mAP@A on MS-COCO Task 4 for MD-DETR and strong baselines.

BENCHMARK

PASCAL-VOC continual detection mAP@A across A+B settings

mAP@A on VOC 2007 for MD-DETR versus PROB across three A+B splits.

KEY INSIGHT

The Counterintuitive Finding

MD-DETR, a replay free method, reaches 50.2 mAP@A on MS-COCO Task 4, beating replay based PROB at 39.9 by 10.3 points.

This is surprising because replay buffers are usually assumed necessary for stability, yet MD-DETR achieves better stability plasticity trade offs without storing any past images.

WHY IT MATTERS

What this unlocks for the field

MD-DETR shows that memory networks with localized retrieval and background thresholding can sustain high mAP across tasks without replay buffers.

Builders can now design privacy preserving continual detectors that avoid sample storage, while still handling background relegation and maintaining strong performance on both past and new classes.

~14 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…