From RAG to Memory: Non-Parametric Continual Learning for Large Language Models

AuthorsBernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi et al.

ICML 20252025

TL;DR

HippoRAG 2 integrates dense passage nodes, query-to-triple Personalized PageRank, and LLM-based recognition memory to reach 59.8 average F1 vs 57.0 for NV-Embed-v2 across seven RAG benchmarks.

THE PROBLEM

RAG loses human like memory robustness across tasks (HippoRAG 2 vs NV Embed v2 57.0 F1)

Existing structure augmented RAG systems drop below strong embedding RAG on all three benchmark types, despite being designed for sense making or associativity.

This means systems like GraphRAG, RAPTOR, LightRAG, and HippoRAG fail to maintain factual memory, sense making, and associativity simultaneously, limiting continual learning.

HOW IT WORKS

HippoRAG 2 — dense sparse integration plus recognition memory

HippoRAG 2 centers on Offline Indexing, a schema less Knowledge Graph, Dense Sparse Integration, Deeper Contextualization, and Recognition Memory to build long term memory.

You can think of HippoRAG 2 like a hippocampus plus neocortex: phrase nodes are sparse concept codes, while passage nodes are dense episodic traces tied together by Personalized PageRank.

This design lets HippoRAG 2 retrieve multi hop, context rich evidence that a plain context window or pure vector retriever cannot assemble reliably.

DIAGRAM

Online retrieval pipeline with recognition memory

Sequence of how HippoRAG 2 processes a query through query to triple retrieval, LLM based triple filtering, graph search, and QA reading.

DIAGRAM

Evaluation pipeline across factual, associative, and sense making tasks

Flowchart of how HippoRAG 2 is evaluated on simple QA, multi hop QA, and discourse understanding benchmarks using shared retrieval and QA readers.

PROCESS

How HippoRAG 2 Handles a Query in Online Retrieval

  1. 01

    Dense Sparse Integration

    HippoRAG 2 embeds the query and links it to phrase and passage nodes via Dense Sparse Integration, preparing seed candidates in the Knowledge Graph.

  2. 02

    Deeper Contextualization

    HippoRAG 2 applies Deeper Contextualization by performing query to triple retrieval, aligning the full query with contextual triples instead of isolated entities.

  3. 03

    Recognition Memory

    HippoRAG 2 invokes Recognition Memory, using an LLM to filter top k retrieved triples and keep only those relevant as seed phrase nodes.

  4. 04

    Personalized PageRank Graph Search

    HippoRAG 2 runs Personalized PageRank on the Knowledge Graph with weighted phrase and passage seeds, then feeds top passages to the QA reader for answering.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Dense Sparse Integration

    HippoRAG 2 augments phrase based Knowledge Graph nodes with passage nodes and context edges, enabling dense sparse integration and achieving 78.2 average Recall@5 versus 73.4 for NV-Embed-v2.

  • 02

    Deeper Contextualization and Recognition Memory

    HippoRAG 2 introduces query to triple linking plus Recognition Memory triple filtering, improving multi hop Recall@5 on MuSiQue from 69.7 to 74.7 over NV-Embed-v2.

  • 03

    Non Parametric Continual Learning Evaluation

    HippoRAG 2 is evaluated across factual, multi hop, and discourse tasks, reaching 59.8 average F1 versus 57.0 for NV-Embed-v2 with Llama 3.3 70B as QA reader.

RESULTS

By the Numbers

Avg F1

59.8

+2.8 over NV-Embed-v2

2Wiki F1

71.0

+9.5 over NV-Embed-v2

MuSiQue Recall@5

74.7

+5.0 over NV-Embed-v2

2Wiki Recall@5

90.4

+13.9 over NV-Embed-v2

On the joint RAG benchmark suite spanning NaturalQuestions, PopQA, MuSiQue, 2Wiki, HotpotQA, LV-Eval, and NarrativeQA, HippoRAG 2 demonstrates stronger retrieval and QA than NV-Embed-v2. These results show HippoRAG 2 can maintain factual, associative, and sense making performance in a single non parametric continual learning system.

BENCHMARK

By the Numbers

On the joint RAG benchmark suite spanning NaturalQuestions, PopQA, MuSiQue, 2Wiki, HotpotQA, LV-Eval, and NarrativeQA, HippoRAG 2 demonstrates stronger retrieval and QA than NV-Embed-v2. These results show HippoRAG 2 can maintain factual, associative, and sense making performance in a single non parametric continual learning system.

BENCHMARK

QA performance (F1 scores) on RAG benchmarks using Llama 3.3 70B Instruct as the QA reader

Average F1 score across seven RAG benchmarks with Llama 3.3 70B Instruct QA reader.

BENCHMARK

Retrieval performance (passage recall@5) on RAG benchmarks

Average passage Recall@5 across NQ, PopQA, MuSiQue, 2Wiki, and HotpotQA.

KEY INSIGHT

The Counterintuitive Finding

Despite adding graph structure and LLM filtering, HippoRAG 2 does not hurt simple QA, reaching 63.3 F1 on NQ versus 61.9 for NV-Embed-v2.

This is surprising because earlier structure augmented RAG like RAPTOR and GraphRAG often degraded factual QA, yet HippoRAG 2 gains +1.4 F1 on NQ while still improving multi hop tasks.

WHY IT MATTERS

What this unlocks for the field

HippoRAG 2 shows that dense sparse Knowledge Graph integration and Recognition Memory can give LLMs a more human like, task robust non parametric memory.

Builders can now design RAG systems that handle factual lookup, multi hop reasoning, and long narrative sense making in one architecture, without sacrificing performance on any single regime.

~14 min read← Back to papers

Related papers

Agent MemoryLong-Term Memory

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu, Liuyi Yao et al.

arXiv 2026 · 2026

Agentic Memory (AgeMem) exposes **memory management tools**, a **three-stage progressive RL strategy**, and **step-wise GRPO** directly inside the agent policy to jointly control long-term and short-term memory. On Qwen3-4B-Instruct, AgeMem attains 54.31% average performance across ALFWorld, SciWorld, PDDL, BabyAI, and HotpotQA, exceeding the best baseline A-Mem at 45.74%.

SurveyAgent Memory

Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

Dongming Jiang, Yi Li et al.

arXiv 2026 · 2026

Anatomy of Agentic Memory organizes Memory-Augmented Generation into four structures and empirically compares systems like LOCOMO, AMem, MemoryOS, Nemori, MAGMA, and SimpleMem under benchmark saturation, metric validity, backbone sensitivity, and system cost. On the LoCoMo benchmark, Anatomy of Agentic Memory shows Nemori reaches 0.502 F1 while AMem drops to 0.116, and MAGMA achieves the top semantic judge score of 0.670 under the MAGMA rubric.

Memory ArchitectureSurvey

Multi-Agent Memory from a Computer Architecture Perspective: Visions and Challenges Ahead

Zhongming Yu, Naicheng Yu et al.

arXiv 2026 · 2026

Multi-Agent Memory Architecture organizes **Agent IO Layer**, **Agent Cache Layer**, and **Agent Memory Layer** plus **Agent Cache Sharing** and **Agent Memory Access** protocols into a unified architectural framing for multi-agent systems. The position-only SYS_NAME proposes no benchmark MAIN_RESULT or numeric comparison against any baseline.