Category

Continual Learning

Continual and lifelong learning with memory — catastrophic forgetting, memory consolidation, and evolving agent memory.

10 papers

BenchmarkBenchmark

Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents

Yuxuan Cai, Jie Zhou et al.

· 2026

PROACTAGENT combines Experience-Enhanced Online Evolution (EXPONEVO), a structured EXPERIENCE BASE, and Proactive Reinforcement Learning-based Retrieval (PROACTRL) to jointly evolve memory and policy with retrieval as an explicit action. On SciWorld, PROACTAGENT reaches 73.50% SR versus 55.50% for GRPO+Reflexion, while cutting interaction rounds from 27.52 to 18.38.

BenchmarkBenchmarkAgent MemoryLong-Term Memory

Governing Evolving Memory in LLM Agents: Risks, Mechanisms, and the Stability and Safety Governed Memory (SSGM) Framework

Chingkwun Lam, Jiaxin Li et al.

· 2026

SSGM interposes a Governance Middleware, Read Filtering Gate, Write Validation Gate, and a dual substrate of Mutable Active Graph plus Immutable Episodic Log between agents and memory. SSGM unifies evolving-memory systems into a four-dimensional failure taxonomy and proves that periodic reconciliation can bound semantic drift over infinite horizons.

BenchmarkBenchmarkCognitive Architecture

Learning to Forget: Sleep-Inspired Memory Consolidation for Resolving Proactive Interference in Large Language Models

Ying Xie

· 2026

SleepGate augments transformers with a Conflict-Aware Temporal Tagger, Forgetting Gate, Consolidation Module, and Sleep Trigger that periodically rewrite the KV cache during sleep micro-cycles. On the PI-LLM benchmark, SleepGate achieves 99.5% retrieval accuracy at PI depth 5 and 97.0% at depth 10, while full KV cache, sliding window, H2O, StreamingLLM, and a decay-only ablation all stay below 18% across all depths.

BenchmarkBenchmarkBenchmarkAgent MemoryLong-Term Memory

MemEvoBench: Benchmarking Memory MisEvolution in LLM Agents

Weiwei Xie, Shaoxiong Guo et al.

· 2026

MemEvoBench combines Misleading Memory Injection, Noisy Tool Returns, Biased User Feedback, and a Memory Modification Tool (+ModTool) to stress-test long-term memory safety in LLM agents across 7 domains and 36 risk types. On the QA Style benchmark, MemEvoBench shows Gemini-2.5-Pro’s ASR drops from 67.0% (Vanilla) to 19.0% with +ModTool in Round 1, while biased feedback can push GPT-5’s QA ASR from 59.0% to 78.0% by Round 3.

PickRAGBenchmark

From RAG to Memory: Non-Parametric Continual Learning for Large Language Models

Bernal Jiménez Gutiérrez, Yiheng Shu et al.

ICML 2025 · 2025

HippoRAG 2 combines Offline Indexing, a schema-less Knowledge Graph, Dense-Sparse Integration, Deeper Contextualization, and Recognition Memory into a neuro-inspired non-parametric memory system for LLMs. On the joint RAG benchmark suite, HippoRAG 2 achieves 59.8 average F1 versus 57.0 for NV-Embed-v2, including 71.0 F1 on 2Wiki compared to 61.5 for NV-Embed-v2.

BenchmarkAgent MemoryMemory Architecture

MemEvolve: Meta-Evolution of Agent Memory Systems

Guibin Zhang, Haotian Ren et al.

· 2025

MemEvolve decomposes agent memory into Encode, Store, Retrieve, and Manage modules and meta evolves these components via a dual evolution process over candidate architectures. On xBench DeepSearch, MemEvolve with GPT 5 mini raises Flash Searcher pass@1 from 69.0 to 74.0 and WebWalkerQA accuracy from 58.82 to 61.18 while keeping API cost near 0.141 per query.

BenchmarkBenchmark

MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

Qingyao Ai, Yichen Tang et al.

arXiv 2025 · 2025

MemoryBench orchestrates Task Provider, User Simulator, and Performance Monitor to feed heterogeneous tasks, simulate explicit and implicit feedback, and score LLM systems across declarative and procedural memory. MemoryBench’s main finding is that state-of-the-art memory systems like A-Mem, Mem0, and MemoryOS often fail to beat naive BM25 or embedding-based RAG on partitions such as SiLo and LiLo.

RAGBenchmarkMemory Architecture

Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models

Jiaqi Cao, Jiarui Wang et al.

· 2025

Memory Decoder combines a Pre-training stage that aligns with kNN-LM distributions and an Inference interpolation mechanism that mixes Memory Decoder and base LLM outputs without changing base parameters. On Wikitext-103, Memory Decoder with 124M parameters reaches 13.36 perplexity on GPT2-small versus 14.76 for DAPT, and on specialized domains a single 0.5B Memory Decoder reduces average perplexity from 14.88 to 4.05 on Qwen2-0.5B.