HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models

AuthorsBernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu et al.

2024

TL;DR

HippoRAG uses an LLM-built knowledge graph plus Personalized PageRank to do single-step multi-hop retrieval, reaching 89.5 R@5 on 2WikiMultiHopQA (+21.6 over ColBERTv2).

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

RAG fails at cross passage knowledge integration (multi-hop R@5 stuck at 68.2)

Current RAG encodes each passage in isolation, so it cannot easily integrate new knowledge across passage boundaries for multi-hop reasoning.

On 2WikiMultiHopQA, strong retriever ColBERTv2 only reaches 68.2 R@5, leaving many multi-hop questions unanswered and limiting downstream QA performance.

HOW IT WORKS

HippoRAG — hippocampal indexing with KGs and Personalized PageRank

HippoRAG’s core mechanism links Offline Indexing, Online Retrieval, a schemaless knowledge graph, retrieval encoders, and node specificity into a hippocampal style memory system.

You can think of HippoRAG as a brain inspired design where an LLM neocortex writes facts into a graph index, and a hippocampus like PageRank search recalls associations.

This Personalized PageRank over the knowledge graph lets HippoRAG perform multi hop retrieval in a single step, beyond what a plain context window or dense retriever can do.

DIAGRAM

Query time retrieval flow in HippoRAG

This diagram shows how HippoRAG processes a user query through entity extraction, graph seeding, Personalized PageRank, and passage ranking during online retrieval.

DIAGRAM

Evaluation pipeline and baselines for HippoRAG

This diagram shows how HippoRAG is evaluated across datasets, compared against baselines, and fed into the QA reader.

PROCESS

How HippoRAG Handles a Multi Hop Question

01
Offline Indexing
HippoRAG uses Offline Indexing with an instruction tuned LLM to extract OpenIE triples and build the schemaless knowledge graph over passages.
02
Synonymy Edges with Retrieval Encoders
HippoRAG applies retrieval encoders like Contriever or ColBERTv2 to add synonymy edges between noun phrase nodes when cosine similarity exceeds threshold τ.
03
Online Retrieval
During Online Retrieval, HippoRAG extracts query named entities, links them to KG nodes via retrieval encoders, and seeds Personalized PageRank with these query nodes.
04
Node Specificity and Passage Ranking
HippoRAG scales query node probabilities by node specificity, runs PPR, aggregates node scores over passages, and returns ranked passages to the QA reader.

KEY CONTRIBUTIONS

Key Contributions

01
Neurobiologically inspired HippoRAG framework
HippoRAG instantiates hippocampal indexing with Offline Indexing, Online Retrieval, a schemaless knowledge graph, and Personalized PageRank to enable single step multi hop retrieval.
02
Strong multi hop QA gains
HippoRAG with Contriever reaches 89.5 R@5 on 2WikiMultiHopQA, improving over ColBERTv2’s 68.2 R@5 by 21.3 points in single step retrieval.
03
Complementary to IRCoT
HippoRAG used inside IRCoT boosts 2WikiMultiHopQA R@5 from 74.4 to 93.9 and average QA F1 to 51.7, while remaining 6 to 13 times faster than IRCoT alone.

RESULTS

By the Numbers

R@5

89.5

+21.3 over ColBERTv2 on 2WikiMultiHopQA single step retrieval

R@2

71.5

+12.3 over ColBERTv2 on 2WikiMultiHopQA single step retrieval

62.7

IRCoT + HippoRAG vs 45.1 for IRCoT on 2WikiMultiHopQA

R@5

57.6

IRCoT + HippoRAG vs 53.7 for IRCoT on MuSiQue

These numbers come from MuSiQue and 2WikiMultiHopQA, which stress multi hop reasoning over multiple passages. The gains show HippoRAG’s graph based retrieval substantially improves both recall and QA F1 over strong dense and iterative baselines.

BENCHMARK

By the Numbers

BENCHMARK

Single step retrieval on 2WikiMultiHopQA (R@5 from Table 2)

R@5 on 2WikiMultiHopQA dev for single step retrieval methods.

BENCHMARK

QA F1 on 2WikiMultiHopQA (Table 4)

F1 on 2WikiMultiHopQA dev for different retrievers feeding the same QA reader.

KEY INSIGHT

The Counterintuitive Finding

Single step HippoRAG matches or beats IRCoT’s QA performance while being 10 to 30 times cheaper and 6 to 13 times faster online.

This is surprising because iterative retrieval like IRCoT was assumed necessary for multi hop reasoning, yet HippoRAG’s single step graph search achieves similar or better accuracy with far lower cost.

WHY IT MATTERS

What this unlocks for the field

HippoRAG unlocks scalable, continually updating long term memory where new knowledge is integrated by adding graph edges, not retraining the LLM.

Builders can now design RAG systems that handle complex path finding multi hop queries efficiently, making tasks like literature review and legal reasoning more practical at scale.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…