RAG Benchmark Benchmark Memory Architecture

Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling

AuthorsChulun Zhou, Chunkang Zhang, Guoxin Yu et al.

2025

TL;DR

HGMEM uses hypergraph-based memory with update, insertion, and merging to boost multi-step RAG, reaching 73.81% accuracy on Prelude vs 72.22% for HippoRAG v2 (+1.59 points).

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Multi-step RAG misses high order correlations in long contexts

HGMEM addresses that existing working memories act as passive storage, overlooking crucial high order correlations among primitive facts and limiting global sense making capacity.

This failure in multi step RAG for long context global sense making leads to fragmented reasoning, weak guidance for subquery generation, and poor complex relational modeling over extended documents.

HOW IT WORKS

HGMEM — Hypergraph based Memory Mechanism

HGMEM introduces Hypergraph-based Memory Storage, Adaptive Memory-based Evidence Retrieval, and Dynamic of Memory Evolving to structure working memory as a hypergraph of expressive memory points.

You can think of HGMEM like a card catalog where each card links many related books at once, and cards themselves can be merged into higher level summaries as understanding deepens.

By explicitly merging memory points into high order hyperedges, HGMEM enables complex relational modeling that a flat context window of primitive facts cannot support.

DIAGRAM

Multi step interaction and memory evolution in HGMEM

This diagram shows how HGMEM interleaves subquery generation, retrieval, and hypergraph memory evolution across multiple interaction steps.

DIAGRAM

Evaluation pipeline and ablation design for HGMEM

This diagram shows how HGMEM is evaluated across datasets and ablations on retrieval strategy and memory evolution operations.

PROCESS

How HGMEM Handles a Multi step RAG Query

01
Problem Formulation
HGMEM formalizes the document D, graph G, and query q̂, linking entities and relationships to source chunks for later Hypergraph-based Memory Storage.
02
Multi step RAG System with Memory
HGMEM runs iterative interaction steps where the LLM decides if current memory M(t) suffices or new subqueries Q(t) are needed for further retrieval.
03
Adaptive Memory based Evidence Retrieval
HGMEM uses Adaptive Memory-based Evidence Retrieval to perform Local Investigation around existing hyperedges or Global Exploration over unseen graph regions.
04
Dynamic of Memory Evolving
HGMEM applies Dynamic of Memory Evolving with update, insertion, and merging operations to build higher order hyperedges before Memory enhanced Response Generation.

KEY CONTRIBUTIONS

Key Contributions

01
Hypergraph based memory mechanism
HGMEM introduces Hypergraph-based Memory Storage where each hyperedge is a memory point connecting multiple entities, enabling flexible modeling of n ary relations beyond binary graphs.
02
Adaptive memory based evidence retrieval
HGMEM proposes Adaptive Memory-based Evidence Retrieval that combines Local Investigation and Global Exploration, improving comprehensiveness from 61.38 to 64.18 on Longbench with Qwen2.5-32B-Instruct.
03
Dynamic memory evolving with merging
HGMEM designs Dynamic of Memory Evolving with update, insertion, and merging, where removing merging drops Prelude accuracy from 70.63 to 61.11 with Qwen2.5-32B-Instruct.

RESULTS

By the Numbers

Comprehensiveness

69.74

+4.76 over DeepRAG (65.98) with GPT-4o on Longbench

Diversity

55.00

+1.00 over ComoRAG (54.00) with GPT-4o on NoCha

Acc (%)

73.81

+1.59 over HippoRAG v2 (72.22) with GPT-4o on Prelude

Acc (%)

64.18

+2.73 over DeepRAG (61.45) with Qwen2.5-32B-Instruct on Longbench

On Longbench generative sense making QA and long narrative understanding benchmarks NarrativeQA, NoCha, and Prelude, HGMEM is evaluated using GPT 4o and Qwen2.5-32B-Instruct. These MAIN_RESULT numbers show that HGMEM consistently improves multi step RAG over NaiveRAG, GraphRAG, LightRAG, HippoRAG v2, DeepRAG, and ComoRAG under matched retrieval budgets.

BENCHMARK

By the Numbers

BENCHMARK

Overall results on Prelude with GPT 4o

Acc (%) on Prelude long narrative understanding benchmark with GPT 4o backbone.

BENCHMARK

Ablation on memory merging for Prelude with Qwen2.5-32B-Instruct

Acc (%) on Prelude when removing update or merging operations in HGMEM.

KEY INSIGHT

The Counterintuitive Finding

HGMEM without merging achieves 70.00 accuracy on primitive NarrativeQA queries, matching full HGMEM despite having lower average entities per hyperedge.

This is surprising because one might expect high order correlations to always help, but for primitive queries extra associations can introduce redundancy without improving entailment.

WHY IT MATTERS

What this unlocks for the field

HGMEM unlocks dynamic high order working memory for multi step RAG, letting LLMs build integrated, situated knowledge structures over long contexts.

With HGMEM, builders can create agents that adaptively explore graphs, evolve hypergraph memories, and answer global sense making questions that previously overwhelmed flat retrieval pipelines.

~14 min read← Back to papers

Related papers

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

RAG

A Dynamic Retrieval-Augmented Generation System with Selective Memory and Remembrance

Okan Bursa

· 2026

Adaptive RAG Memory (ARM) augments a standard retriever–generator stack with a Dynamic Embedding Layer and Remembrance Engine that track usage statistics and apply selective remembrance and decay to embeddings. On a lightweight retrieval benchmark, ARM achieves NDCG@5 ≈ 0.9401 and Recall@5 = 1.000 with 22M parameters, matching larger baselines like gte-small while providing the best efficiency among ultra-efficient models.

arXiv:2601.02428 Read explainer

RAGLong-Term Memory

HingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable Dialogues

Yijie Zhong, Yunfan Gao, Haofen Wang

· 2026

HingeMem combines Boundary Guided Long-Term Memory, Dialogue Boundary Extraction, Memory Construction, Query Adaptive Retrieval, Hyperedge Rerank, and Adaptive Stop to segment dialogues into element-indexed hyperedges and plan query-specific retrieval. On LOCOMO, HingeMem achieves 63.9 overall F1 and 75.1 LLM-as-a-Judge score, surpassing the best baseline Zep (56.9 F1) by 7.0 F1 without using category-specific QA formats.

arXiv:2604.06845 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…