Continual Learning

Continual and lifelong learning with memory — catastrophic forgetting, memory consolidation, and evolving agent memory.

10 papers

BenchmarkBenchmark

Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents

Yuxuan Cai, Jie Zhou et al.

· 2026

PROACTAGENT combines Experience-Enhanced Online Evolution (EXPONEVO), a structured EXPERIENCE BASE, and Proactive Reinforcement Learning-based Retrieval (PROACTRL) to jointly evolve memory and policy with retrieval as an explicit action. On SciWorld, PROACTAGENT reaches 73.50% SR versus 55.50% for GRPO+Reflexion, while cutting interaction rounds from 27.52 to 18.38.

arXiv:2604.20572 Read explainer

BenchmarkBenchmarkAgent MemoryLong-Term Memory

Governing Evolving Memory in LLM Agents: Risks, Mechanisms, and the Stability and Safety Governed Memory (SSGM) Framework

Chingkwun Lam, Jiaxin Li et al.

· 2026

SSGM interposes a Governance Middleware, Read Filtering Gate, Write Validation Gate, and a dual substrate of Mutable Active Graph plus Immutable Episodic Log between agents and memory. SSGM unifies evolving-memory systems into a four-dimensional failure taxonomy and proves that periodic reconciliation can bound semantic drift over infinite horizons.

arXiv:2603.11768 Read explainer

BenchmarkBenchmarkCognitive Architecture

Learning to Forget: Sleep-Inspired Memory Consolidation for Resolving Proactive Interference in Large Language Models

Ying Xie

· 2026

SleepGate augments transformers with a Conflict-Aware Temporal Tagger, Forgetting Gate, Consolidation Module, and Sleep Trigger that periodically rewrite the KV cache during sleep micro-cycles. On the PI-LLM benchmark, SleepGate achieves 99.5% retrieval accuracy at PI depth 5 and 97.0% at depth 10, while full KV cache, sliding window, H2O, StreamingLLM, and a decay-only ablation all stay below 18% across all depths.

arXiv:2603.14517 Read explainer

BenchmarkBenchmarkBenchmarkAgent MemoryLong-Term Memory

MemEvoBench: Benchmarking Memory MisEvolution in LLM Agents

Weiwei Xie, Shaoxiong Guo et al.

· 2026

MemEvoBench combines Misleading Memory Injection, Noisy Tool Returns, Biased User Feedback, and a Memory Modification Tool (+ModTool) to stress-test long-term memory safety in LLM agents across 7 domains and 36 risk types. On the QA Style benchmark, MemEvoBench shows Gemini-2.5-Pro’s ASR drops from 67.0% (Vanilla) to 19.0% with +ModTool in Round 1, while biased feedback can push GPT-5’s QA ASR from 59.0% to 78.0% by Round 3.

arXiv:2604.15774 Read explainer

PickRAGBenchmark

From RAG to Memory: Non-Parametric Continual Learning for Large Language Models

Bernal Jiménez Gutiérrez, Yiheng Shu et al.

ICML 2025 · 2025

HippoRAG 2 combines Offline Indexing, a schema-less Knowledge Graph, Dense-Sparse Integration, Deeper Contextualization, and Recognition Memory into a neuro-inspired non-parametric memory system for LLMs. On the joint RAG benchmark suite, HippoRAG 2 achieves 59.8 average F1 versus 57.0 for NV-Embed-v2, including 71.0 F1 on 2Wiki compared to 61.5 for NV-Embed-v2.

arXiv:2502.14802 Code Read explainer

Benchmark

MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents

Zijian Zhou, Ao Qu et al.

· 2025

MEM1 trains an RL agent that updates a compact Internal State (IS) using Masked Trajectory for Policy Optimization, 2D Attention Mask, and Multi-Objective Task Design to jointly consolidate memory and reason. On 16-objective multi-hop QA, MEM1-7B reaches EM 1.97 vs 0.567 for Qwen2.5-14B-Instruct while using 10.4×10² peak tokens vs 38.4×10².

arXiv:2506.15841 Read explainer

BenchmarkAgent MemoryMemory Architecture

MemEvolve: Meta-Evolution of Agent Memory Systems

Guibin Zhang, Haotian Ren et al.

· 2025

MemEvolve decomposes agent memory into Encode, Store, Retrieve, and Manage modules and meta evolves these components via a dual evolution process over candidate architectures. On xBench DeepSearch, MemEvolve with GPT 5 mini raises Flash Searcher pass@1 from 69.0 to 74.0 and WebWalkerQA accuracy from 58.82 to 61.18 while keeping API cost near 0.141 per query.

arXiv:2512.18746 Read explainer

BenchmarkBenchmark

MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

Qingyao Ai, Yichen Tang et al.

arXiv 2025 · 2025

MemoryBench orchestrates Task Provider, User Simulator, and Performance Monitor to feed heterogeneous tasks, simulate explicit and implicit feedback, and score LLM systems across declarative and procedural memory. MemoryBench’s main finding is that state-of-the-art memory systems like A-Mem, Mem0, and MemoryOS often fail to beat naive BM25 or embedding-based RAG on partitions such as SiLo and LiLo.

arXiv:2510.17281 Read explainer

RAGBenchmarkMemory Architecture

Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models

Jiaqi Cao, Jiarui Wang et al.

· 2025

Memory Decoder combines a Pre-training stage that aligns with kNN-LM distributions and an Inference interpolation mechanism that mixes Memory Decoder and base LLM outputs without changing base parameters. On Wikitext-103, Memory Decoder with 124M parameters reaches 13.36 perplexity on GPT2-small versus 14.76 for DAPT, and on specialized domains a single 0.5B Memory Decoder reduces average perplexity from 14.88 to 4.05 on Qwen2-0.5B.

arXiv:2508.09874 Read explainer

RAGBenchmarkBenchmarkLong-Term Memory

RGMem: Renormalization Group-inspired Memory Evolution for Language Agents

Ao Tian, Yunfeng Lu et al.

· 2025

RGMem builds a multi-scale memory state using Microscopic Evidence Space DL0, Structured Knowledge Space G, and renormalization operators RK1, RK2, RK3 to evolve user profiles. On PersonaMem with GPT-4.1, RGMem reaches 74.01% Avg., beating Memory OS by 8.98 points.

arXiv:2510.16392 Read explainer