From RAG to Memory: Non-Parametric Continual Learning for Large Language Models

AuthorsBernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi et al.

ICML 20252025

TL;DR

HippoRAG 2 integrates dense passage nodes, query-to-triple Personalized PageRank, and LLM-based recognition memory to reach 59.8 average F1 vs 57.0 for NV-Embed-v2 across seven RAG benchmarks.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

RAG loses human like memory robustness across tasks (HippoRAG 2 vs NV Embed v2 57.0 F1)

Existing structure augmented RAG systems drop below strong embedding RAG on all three benchmark types, despite being designed for sense making or associativity.

This means systems like GraphRAG, RAPTOR, LightRAG, and HippoRAG fail to maintain factual memory, sense making, and associativity simultaneously, limiting continual learning.

HOW IT WORKS

HippoRAG 2 — dense sparse integration plus recognition memory

HippoRAG 2 centers on Offline Indexing, a schema less Knowledge Graph, Dense Sparse Integration, Deeper Contextualization, and Recognition Memory to build long term memory.

You can think of HippoRAG 2 like a hippocampus plus neocortex: phrase nodes are sparse concept codes, while passage nodes are dense episodic traces tied together by Personalized PageRank.

This design lets HippoRAG 2 retrieve multi hop, context rich evidence that a plain context window or pure vector retriever cannot assemble reliably.

DIAGRAM

Online retrieval pipeline with recognition memory

Sequence of how HippoRAG 2 processes a query through query to triple retrieval, LLM based triple filtering, graph search, and QA reading.

DIAGRAM

Evaluation pipeline across factual, associative, and sense making tasks

Flowchart of how HippoRAG 2 is evaluated on simple QA, multi hop QA, and discourse understanding benchmarks using shared retrieval and QA readers.

PROCESS

How HippoRAG 2 Handles a Query in Online Retrieval

01
Dense Sparse Integration
HippoRAG 2 embeds the query and links it to phrase and passage nodes via Dense Sparse Integration, preparing seed candidates in the Knowledge Graph.
02
Deeper Contextualization
HippoRAG 2 applies Deeper Contextualization by performing query to triple retrieval, aligning the full query with contextual triples instead of isolated entities.
03
Recognition Memory
HippoRAG 2 invokes Recognition Memory, using an LLM to filter top k retrieved triples and keep only those relevant as seed phrase nodes.
04
Personalized PageRank Graph Search
HippoRAG 2 runs Personalized PageRank on the Knowledge Graph with weighted phrase and passage seeds, then feeds top passages to the QA reader for answering.

KEY CONTRIBUTIONS

Key Contributions

01
Dense Sparse Integration
HippoRAG 2 augments phrase based Knowledge Graph nodes with passage nodes and context edges, enabling dense sparse integration and achieving 78.2 average Recall@5 versus 73.4 for NV-Embed-v2.
02
Deeper Contextualization and Recognition Memory
HippoRAG 2 introduces query to triple linking plus Recognition Memory triple filtering, improving multi hop Recall@5 on MuSiQue from 69.7 to 74.7 over NV-Embed-v2.
03
Non Parametric Continual Learning Evaluation
HippoRAG 2 is evaluated across factual, multi hop, and discourse tasks, reaching 59.8 average F1 versus 57.0 for NV-Embed-v2 with Llama 3.3 70B as QA reader.

RESULTS

By the Numbers

Avg F1

59.8

+2.8 over NV-Embed-v2

2Wiki F1

71.0

+9.5 over NV-Embed-v2

MuSiQue Recall@5

74.7

+5.0 over NV-Embed-v2

2Wiki Recall@5

90.4

+13.9 over NV-Embed-v2

On the joint RAG benchmark suite spanning NaturalQuestions, PopQA, MuSiQue, 2Wiki, HotpotQA, LV-Eval, and NarrativeQA, HippoRAG 2 demonstrates stronger retrieval and QA than NV-Embed-v2. These results show HippoRAG 2 can maintain factual, associative, and sense making performance in a single non parametric continual learning system.

BENCHMARK

By the Numbers

BENCHMARK

QA performance (F1 scores) on RAG benchmarks using Llama 3.3 70B Instruct as the QA reader

Average F1 score across seven RAG benchmarks with Llama 3.3 70B Instruct QA reader.

BENCHMARK

Retrieval performance (passage recall@5) on RAG benchmarks

Average passage Recall@5 across NQ, PopQA, MuSiQue, 2Wiki, and HotpotQA.

KEY INSIGHT

The Counterintuitive Finding

Despite adding graph structure and LLM filtering, HippoRAG 2 does not hurt simple QA, reaching 63.3 F1 on NQ versus 61.9 for NV-Embed-v2.

This is surprising because earlier structure augmented RAG like RAPTOR and GraphRAG often degraded factual QA, yet HippoRAG 2 gains +1.4 F1 on NQ while still improving multi hop tasks.

WHY IT MATTERS

What this unlocks for the field

HippoRAG 2 shows that dense sparse Knowledge Graph integration and Recognition Memory can give LLMs a more human like, task robust non parametric memory.

Builders can now design RAG systems that handle factual lookup, multi hop reasoning, and long narrative sense making in one architecture, without sacrificing performance on any single regime.

~14 min read← Back to papers

Related papers

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

RAG

A Dynamic Retrieval-Augmented Generation System with Selective Memory and Remembrance

Okan Bursa

· 2026

Adaptive RAG Memory (ARM) augments a standard retriever–generator stack with a Dynamic Embedding Layer and Remembrance Engine that track usage statistics and apply selective remembrance and decay to embeddings. On a lightweight retrieval benchmark, ARM achieves NDCG@5 ≈ 0.9401 and Recall@5 = 1.000 with 22M parameters, matching larger baselines like gte-small while providing the best efficiency among ultra-efficient models.

arXiv:2601.02428 Read explainer

RAGLong-Term Memory

HingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable Dialogues

Yijie Zhong, Yunfan Gao, Haofen Wang

· 2026

HingeMem combines Boundary Guided Long-Term Memory, Dialogue Boundary Extraction, Memory Construction, Query Adaptive Retrieval, Hyperedge Rerank, and Adaptive Stop to segment dialogues into element-indexed hyperedges and plan query-specific retrieval. On LOCOMO, HingeMem achieves 63.9 overall F1 and 75.1 LLM-as-a-Judge score, surpassing the best baseline Zep (56.9 F1) by 7.0 F1 without using category-specific QA formats.

arXiv:2604.06845 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…