HingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable Dialogues

AuthorsYijie Zhong, Yunfan Gao, Haofen Wang

2026

TL;DR

HingeMem uses boundary-guided hyperedge memory plus query-adaptive retrieval to reach 63.9 F1 on LOCOMO, +5.5 over Zep without category templates.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Fixed Top-k Retrieval Fails on Diverse Long-Term Dialogue Queries (≈30% Drop Without Categories)

HingeMem targets long-term dialogue memory where performance drops around 30% when query categories are unspecified and fixed Top-k retrieval is used.

In such settings, systems like Memorybank and Zep lose 30–40% performance, causing unstable answers and inefficient, noisy retrieval over ultra-long histories.

HOW IT WORKS

HingeMem — Boundary Guided Memory with Query Adaptive Retrieval

HingeMem centers on Boundary Guided Long-Term Memory, Dialogue Boundary Extraction, Memory Construction, Query Adaptive Retrieval, Hyperedge Rerank, and Adaptive Stop to segment events and plan retrieval.

You can think of HingeMem like a hippocampus plus cortex: the cortex marks event boundaries, and the hippocampus stores hyperedges as an indexable card catalog of experiences.

This design lets HingeMem decide both what to retrieve and how much to retrieve, instead of stuffing a plain context window with a fixed Top-k list of memories.

DIAGRAM

Query Adaptive Retrieval Pipeline in HingeMem

This diagram shows how HingeMem analyzes a query, plans retrieval, reranks hyperedges, and adaptively stops to select memory.

DIAGRAM

LOCOMO Evaluation and Ablation Design for HingeMem

This diagram shows how HingeMem is evaluated on LOCOMO and how the ablations compare boundary memory and adaptive retrieval.

PROCESS

How HingeMem Handles a LOCOMO Question Over Long Dialogues

  1. 01

    Dialogue Boundary Extraction

    HingeMem uses Dialogue Boundary Extraction to segment each session whenever person, time, location, or topic changes, producing element nodes and segment reasons.

  2. 02

    Memory Construction

    HingeMem runs Memory Construction to merge nodes, compute salience scores, cluster topics, and build hyperedges into Boundary Guided Long-Term Memory.

  3. 03

    Query Adaptive Retrieval

    Given a question, HingeMem applies Query Adaptive Retrieval to infer query type, select relevant elements, and generate a retrieval plan with element priorities.

  4. 04

    Hyperedge Rerank and Adaptive Stop

    HingeMem feeds candidate hyperedges into Hyperedge Rerank and Adaptive Stop, then passes the selected hyperedges as context for answer generation.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Boundary Guided Long-Term Memory

    HingeMem introduces Boundary Guided Long-Term Memory that writes hyperedges when person, time, location, or topic changes, preserving details while avoiding continuous summarization.

  • 02

    Query Adaptive Retrieval Mechanism

    HingeMem proposes Query Adaptive Retrieval that predicts Recall Priority, Precision Priority, or Judgment queries and plans element-aware routing over the boundary memory.

  • 03

    Efficient Long-Term Memory on LOCOMO

    HingeMem achieves 63.9 overall F1 and 75.1 LLM-as-a-Judge on LOCOMO, with 68% lower question answering token cost than HippoRAG2 while not using category-specific templates.

RESULTS

By the Numbers

Overall F1

63.9

+7.0 over Zep (56.9 F1 without category format)

Overall J

75.1

+5.5 over HippoRAG2 (69.6 J with category format approximated from 70.6)

BLEU-1

0.404

+0.012 over Zep (0.392 BLEU-1 with category format)

Multi-Hop F1

53.6

+12.5 over HippoRAG2 (41.1 F1 with category format)

On the ultra-long dialogue benchmark LOCOMO, which averages 15,965.8 tokens per conversation and 1,986 questions across five categories, HingeMem demonstrates that boundary-guided memory plus query-adaptive retrieval can handle diverse query types without category templates. These results show that HingeMem scales across LLM sizes while improving both accuracy and efficiency for long-term conversational memory.

BENCHMARK

By the Numbers

On the ultra-long dialogue benchmark LOCOMO, which averages 15,965.8 tokens per conversation and 1,986 questions across five categories, HingeMem demonstrates that boundary-guided memory plus query-adaptive retrieval can handle diverse query types without category templates. These results show that HingeMem scales across LLM sizes while improving both accuracy and efficiency for long-term conversational memory.

BENCHMARK

Overall F1 on LOCOMO Without Category-Specific QA Formats

Overall F1 scores on LOCOMO when query categories are not provided to the systems.

BENCHMARK

Ablation: Overall F1 for Boundary and Retrieval Variants

Overall F1 for RAG with text memory versus HingeMem boundary memory and adaptive retrieval variants.

KEY INSIGHT

The Counterintuitive Finding

HingeMem without any category-specific question templates reaches 63.9 overall F1, while Zep with templates only gets 56.9 F1.

This is surprising because template-aware baselines should have an advantage, yet HingeMem’s boundary-guided memory and adaptive retrieval outperform them without extra category hints.

WHY IT MATTERS

What this unlocks for the field

HingeMem shows that boundary-triggered hyperedges plus query-adaptive retrieval can support scalable, interpretable long-term memory across ultra-long multi-session dialogues.

Builders can now deploy assistants that remember months of interaction, adapt retrieval depth per query type, and keep token costs manageable for web and edge applications.

~14 min read← Back to papers

Related papers

Long-Term Memory

Adaptive Memory Admission Control for LLM Agents

Guilin Zhang, Wei Jiang et al.

· 2026

A-MAC scores candidate memories using Utility, Confidence, Novelty, Recency, and Type Prior combined by a learned linear admission policy with Algorithm 1 A-MAC Memory Admission. On the LoCoMo benchmark, A-MAC achieves F1 0.583 and 2644 ms latency, improving F1 by 0.042 and reducing latency by 1187 ms compared to A-mem.

Agent MemoryLong-Term Memory

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu, Liuyi Yao et al.

arXiv 2026 · 2026

Agentic Memory (AgeMem) exposes memory management tools, a three-stage progressive RL strategy, and step-wise GRPO directly inside the agent policy to jointly control long-term and short-term memory. On Qwen3-4B-Instruct, AgeMem attains 54.31% average performance across ALFWorld, SciWorld, PDDL, BabyAI, and HotpotQA, exceeding the best baseline A-Mem at 45.74%.

Questions about this paper?

Paper: HingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable Dialogues

Answers use this explainer on Memory Papers.

Checking…