GAM: Hierarchical Graph-based Agentic Memory for LLM Agents

AuthorsZhaofen Wu, Hanrong Zhang, Fulin Lin et al.

2026

TL;DR

GAM uses state-based hierarchical graph memory with semantic-event-triggered consolidation to reach 40.00 F1 on LoCoMo vs 35.38 for Mem0 (+4.62).

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Unified streams cause memory loss and semantic drift in long dialogs

GAM targets unified stream-based memory systems where continuous updates cause Memory Loss and Semantic Drift, corrupting long-term knowledge during ongoing interactions.

These failures break long-horizon LLM agents on datasets like LoCoMo and LongDialQA, where agents must maintain coherent long-term interactions across evolving multi-session dialogues.

HOW IT WORKS

GAM — Hierarchical Graph-based Agentic Memory

GAM’s core mechanism combines a Hierarchical Graph Memory Architecture, Topic Associative Network, Event Progression Graphs, State-Based Memory Consolidation, and Graph-Guided Multi-Factor Retrieval to separate rapid encoding from stable storage.

You can think of GAM like a brain with a short-term hippocampus buffer and a long-term cortical knowledge graph, plus a librarian that files only complete stories into the archive.

This design lets GAM update memory only at semantically complete boundaries, enabling precise, low-interference recall that a plain context window or flat vector store cannot match.

DIAGRAM

Graph-Guided Multi-Factor Retrieval Pipeline

This diagram shows how GAM performs graph-guided multi-factor retrieval from the Topic Associative Network and archived Event Progression Graphs to answer a query.

DIAGRAM

State-Based Memory Consolidation Lifecycle

This diagram shows how GAM switches between Episodic Buffering State and Semantic Consolidation State using semantic boundary detection.

PROCESS

How GAM Handles a Dialogue Session — State-Based Memory Consolidation

  1. 01

    Episodic Buffering Phase

    GAM builds Event Progression Graphs from incoming utterances, appending atomic event units and edges while isolating updates from the Topic Associative Network.

  2. 02

    Semantic Boundary Detection

    GAM uses State-Based Memory Consolidation with an LLM discriminator and a 2048 token episodic buffer to detect semantic divergence and set bt when topics shift.

  3. 03

    Semantic Consolidation Phase

    GAM summarizes the buffered Event Progression Graph into a dual representation node vnew with csum and craw, then integrates it into the Topic Associative Network.

  4. 04

    Graph-Guided Multi-Factor Retrieval

    During question answering, GAM runs Graph-Guided Multi-Factor Retrieval over the hierarchical graph, using temporal, confidence, and role factors to rank candidate memories.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Hierarchical Graph Memory Architecture

    GAM introduces a Hierarchical Graph Memory Architecture that separates the global Topic Associative Network from local Event Progression Graphs, structurally mitigating interference between rapid encoding and long-term retention.

  • 02

    State-Based Memory Consolidation mechanism

    GAM designs a State-Based Memory Consolidation mechanism that switches between Episodic Buffering State and Semantic Consolidation State using semantic divergence detection instead of arbitrary token or time thresholds.

  • 03

    Graph-Guided Multi-Factor Retrieval strategy

    GAM proposes a Graph-Guided Multi-Factor Retrieval strategy that combines graph traversal with temporal, confidence, and role-based boosting factors βtime=1.4, βrole=1.4, and βconf=1.2 for precise recall.

RESULTS

By the Numbers

Average F1

40.00

+4.62 over Mem0 on LoCoMo with Qwen 2.5-7B

Average BLEU-1

32.99

+4.32 over Mem0 on LoCoMo with Qwen 2.5-7B

Average F1 LongDialQA

12.55

+5.79 over MemoryOS with Qwen 2.5-7B

Tokens per Query

1370.18

11% fewer tokens than Mem0’s 1533.94 on LoCoMo

On LoCoMo, which tests multi-hop, temporal, open-domain, and single-hop reasoning, GAM reaches 40.00 Average F1 with Qwen 2.5-7B compared to Mem0’s 35.38. On LongDialQA’s multi-party TV-script dialogues, GAM achieves 12.55 Average F1 with Qwen 2.5-7B versus MemoryOS at 6.76, showing that GAM’s hierarchical memory substantially improves long-context reasoning efficiency.

BENCHMARK

By the Numbers

On LoCoMo, which tests multi-hop, temporal, open-domain, and single-hop reasoning, GAM reaches 40.00 Average F1 with Qwen 2.5-7B compared to Mem0’s 35.38. On LongDialQA’s multi-party TV-script dialogues, GAM achieves 12.55 Average F1 with Qwen 2.5-7B versus MemoryOS at 6.76, showing that GAM’s hierarchical memory substantially improves long-context reasoning efficiency.

BENCHMARK

LoCoMo Average F1 with Qwen 2.5-7B

Average F1 on LoCoMo across memory systems using the Qwen 2.5-7B-Instruct backbone.

BENCHMARK

LongDialQA Average F1 with Qwen 2.5-7B

Average F1 on LongDialQA for different memory systems using Qwen 2.5-7B-Instruct.

KEY INSIGHT

The Counterintuitive Finding

On LoCoMo, GAM uses only 1370.18 tokens per query yet reaches 40.00 Average F1, while Mem0 needs 1533.94 tokens for 35.38 F1.

This is surprising because unified stream systems like Mem0 expose more raw context, but GAM’s selective consolidation and retrieval show that less, well-structured memory can yield higher accuracy.

WHY IT MATTERS

What this unlocks for the field

GAM unlocks agentic memory that can rapidly encode live dialogue while protecting long-term knowledge from contamination using hierarchical graphs and semantic-event-triggered consolidation.

Builders can now design LLM agents that sustain multi-session, multi-party conversations with stable, inspectable memory graphs instead of opaque, ever-growing context windows.

~13 min read← Back to papers

Related papers

Agent MemoryLong-Term Memory

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu, Liuyi Yao et al.

arXiv 2026 · 2026

Agentic Memory (AgeMem) exposes memory management tools, a three-stage progressive RL strategy, and step-wise GRPO directly inside the agent policy to jointly control long-term and short-term memory. On Qwen3-4B-Instruct, AgeMem attains 54.31% average performance across ALFWorld, SciWorld, PDDL, BabyAI, and HotpotQA, exceeding the best baseline A-Mem at 45.74%.

Agent Memory

AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations

Cheng Jiayang, Dongyu Ru et al.

· 2026

AMemGym combines Structured Data Generation, On-Policy Interaction, Evaluation Metrics, and Meta-Evaluation to script user state trajectories, drive LLM-simulated role-play, and score write–read–utilization behavior. On AMemGym’s base configuration, AWE-(2,4,30) reaches a 0.291 normalized memory score on interactive evaluation, while native gpt-4.1-mini only achieves 0.203, exposing substantial gaps between memory agents and plain long-context LLMs.

Agent Memory

AMV-L: Lifecycle-Managed Agent Memory for Tail-Latency Control in Long-Running LLM Systems

Emmanuel Bamidele

· 2026

AMV-L manages agent memory using a Memory Value Model, Tiered Lifecycle, Bounded Retrieval Path, and Lifecycle Manager to decouple retention from retrieval eligibility. Under a 70k-request long-running workload, AMV-L improves throughput from 9.027 to 36.977 req/s over TTL and reduces p99 latency from 5398.167 ms to 1233.430 ms while matching LRU’s retrieval quality.

Questions about this paper?

Paper: GAM: Hierarchical Graph-based Agentic Memory for LLM Agents

Answers use this explainer on Memory Papers.

Checking…