HingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable Dialogues

AuthorsYijie Zhong, Yunfan Gao, Haofen Wang

2026

TL;DR

HingeMem uses boundary-guided hyperedge memory plus query-adaptive retrieval to reach 63.9 F1 on LOCOMO, +5.5 over Zep without category templates.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Fixed Top-k Retrieval Fails on Diverse Long-Term Dialogue Queries (≈30% Drop Without Categories)

HingeMem targets long-term dialogue memory where performance drops around 30% when query categories are unspecified and fixed Top-k retrieval is used.

In such settings, systems like Memorybank and Zep lose 30–40% performance, causing unstable answers and inefficient, noisy retrieval over ultra-long histories.

HOW IT WORKS

HingeMem — Boundary Guided Memory with Query Adaptive Retrieval

HingeMem centers on Boundary Guided Long-Term Memory, Dialogue Boundary Extraction, Memory Construction, Query Adaptive Retrieval, Hyperedge Rerank, and Adaptive Stop to segment events and plan retrieval.

You can think of HingeMem like a hippocampus plus cortex: the cortex marks event boundaries, and the hippocampus stores hyperedges as an indexable card catalog of experiences.

This design lets HingeMem decide both what to retrieve and how much to retrieve, instead of stuffing a plain context window with a fixed Top-k list of memories.

DIAGRAM

Query Adaptive Retrieval Pipeline in HingeMem

This diagram shows how HingeMem analyzes a query, plans retrieval, reranks hyperedges, and adaptively stops to select memory.

DIAGRAM

LOCOMO Evaluation and Ablation Design for HingeMem

This diagram shows how HingeMem is evaluated on LOCOMO and how the ablations compare boundary memory and adaptive retrieval.

PROCESS

How HingeMem Handles a LOCOMO Question Over Long Dialogues

01
Dialogue Boundary Extraction
HingeMem uses Dialogue Boundary Extraction to segment each session whenever person, time, location, or topic changes, producing element nodes and segment reasons.
02
Memory Construction
HingeMem runs Memory Construction to merge nodes, compute salience scores, cluster topics, and build hyperedges into Boundary Guided Long-Term Memory.
03
Query Adaptive Retrieval
Given a question, HingeMem applies Query Adaptive Retrieval to infer query type, select relevant elements, and generate a retrieval plan with element priorities.
04
Hyperedge Rerank and Adaptive Stop
HingeMem feeds candidate hyperedges into Hyperedge Rerank and Adaptive Stop, then passes the selected hyperedges as context for answer generation.

KEY CONTRIBUTIONS

Key Contributions

01
Boundary Guided Long-Term Memory
HingeMem introduces Boundary Guided Long-Term Memory that writes hyperedges when person, time, location, or topic changes, preserving details while avoiding continuous summarization.
02
Query Adaptive Retrieval Mechanism
HingeMem proposes Query Adaptive Retrieval that predicts Recall Priority, Precision Priority, or Judgment queries and plans element-aware routing over the boundary memory.
03
Efficient Long-Term Memory on LOCOMO
HingeMem achieves 63.9 overall F1 and 75.1 LLM-as-a-Judge on LOCOMO, with 68% lower question answering token cost than HippoRAG2 while not using category-specific templates.

RESULTS

By the Numbers

Overall F1

63.9

+7.0 over Zep (56.9 F1 without category format)

Overall J

75.1

+5.5 over HippoRAG2 (69.6 J with category format approximated from 70.6)

BLEU-1

0.404

+0.012 over Zep (0.392 BLEU-1 with category format)

Multi-Hop F1

53.6

+12.5 over HippoRAG2 (41.1 F1 with category format)

On the ultra-long dialogue benchmark LOCOMO, which averages 15,965.8 tokens per conversation and 1,986 questions across five categories, HingeMem demonstrates that boundary-guided memory plus query-adaptive retrieval can handle diverse query types without category templates. These results show that HingeMem scales across LLM sizes while improving both accuracy and efficiency for long-term conversational memory.

BENCHMARK

By the Numbers

BENCHMARK

Overall F1 on LOCOMO Without Category-Specific QA Formats

Overall F1 scores on LOCOMO when query categories are not provided to the systems.

BENCHMARK

Ablation: Overall F1 for Boundary and Retrieval Variants

Overall F1 for RAG with text memory versus HingeMem boundary memory and adaptive retrieval variants.

KEY INSIGHT

The Counterintuitive Finding

HingeMem without any category-specific question templates reaches 63.9 overall F1, while Zep with templates only gets 56.9 F1.

This is surprising because template-aware baselines should have an advantage, yet HingeMem’s boundary-guided memory and adaptive retrieval outperform them without extra category hints.

WHY IT MATTERS

What this unlocks for the field

HingeMem shows that boundary-triggered hyperedges plus query-adaptive retrieval can support scalable, interpretable long-term memory across ultra-long multi-session dialogues.

Builders can now deploy assistants that remember months of interaction, adapt retrieval depth per query type, and keep token costs manageable for web and edge applications.

~14 min read← Back to papers

Related papers

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

RAG

A Dynamic Retrieval-Augmented Generation System with Selective Memory and Remembrance

Okan Bursa

· 2026

Adaptive RAG Memory (ARM) augments a standard retriever–generator stack with a Dynamic Embedding Layer and Remembrance Engine that track usage statistics and apply selective remembrance and decay to embeddings. On a lightweight retrieval benchmark, ARM achieves NDCG@5 ≈ 0.9401 and Recall@5 = 1.000 with 22M parameters, matching larger baselines like gte-small while providing the best efficiency among ultra-efficient models.

arXiv:2601.02428 Read explainer

BenchmarkRAG

Knowledge Packs: Zero-Token Knowledge Delivery via KV Cache Injection

Andrey Pustovit

· 2026

Knowledge Packs pre-compute KV Cache Injection, KV–Prefix Equivalence, Banked Routing, and KV Composition to deliver retrieved knowledge and steering via KV states instead of prompt tokens. On HotpotQA, Knowledge Packs’ KV-chat matches RAG at 65.2% EM on Qwen3-8B with 0/500 divergences while eliminating 284 tokens of retrieval text per query.

arXiv:2604.03270 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…