Benchmark Agent Memory Memory Architecture

GAM: Hierarchical Graph-based Agentic Memory for LLM Agents

AuthorsZhaofen Wu, Hanrong Zhang, Fulin Lin et al.

2026

TL;DR

GAM uses state-based hierarchical graph memory with semantic-event-triggered consolidation to reach 40.00 F1 on LoCoMo vs 35.38 for Mem0 (+4.62).

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Unified streams cause memory loss and semantic drift in long dialogs

GAM targets unified stream-based memory systems where continuous updates cause Memory Loss and Semantic Drift, corrupting long-term knowledge during ongoing interactions.

These failures break long-horizon LLM agents on datasets like LoCoMo and LongDialQA, where agents must maintain coherent long-term interactions across evolving multi-session dialogues.

HOW IT WORKS

GAM — Hierarchical Graph-based Agentic Memory

GAM’s core mechanism combines a Hierarchical Graph Memory Architecture, Topic Associative Network, Event Progression Graphs, State-Based Memory Consolidation, and Graph-Guided Multi-Factor Retrieval to separate rapid encoding from stable storage.

You can think of GAM like a brain with a short-term hippocampus buffer and a long-term cortical knowledge graph, plus a librarian that files only complete stories into the archive.

This design lets GAM update memory only at semantically complete boundaries, enabling precise, low-interference recall that a plain context window or flat vector store cannot match.

DIAGRAM

Graph-Guided Multi-Factor Retrieval Pipeline

This diagram shows how GAM performs graph-guided multi-factor retrieval from the Topic Associative Network and archived Event Progression Graphs to answer a query.

DIAGRAM

State-Based Memory Consolidation Lifecycle

This diagram shows how GAM switches between Episodic Buffering State and Semantic Consolidation State using semantic boundary detection.

PROCESS

How GAM Handles a Dialogue Session — State-Based Memory Consolidation

01
Episodic Buffering Phase
GAM builds Event Progression Graphs from incoming utterances, appending atomic event units and edges while isolating updates from the Topic Associative Network.
02
Semantic Boundary Detection
GAM uses State-Based Memory Consolidation with an LLM discriminator and a 2048 token episodic buffer to detect semantic divergence and set bt when topics shift.
03
Semantic Consolidation Phase
GAM summarizes the buffered Event Progression Graph into a dual representation node vnew with csum and craw, then integrates it into the Topic Associative Network.
04
Graph-Guided Multi-Factor Retrieval
During question answering, GAM runs Graph-Guided Multi-Factor Retrieval over the hierarchical graph, using temporal, confidence, and role factors to rank candidate memories.

KEY CONTRIBUTIONS

Key Contributions

01
Hierarchical Graph Memory Architecture
GAM introduces a Hierarchical Graph Memory Architecture that separates the global Topic Associative Network from local Event Progression Graphs, structurally mitigating interference between rapid encoding and long-term retention.
02
State-Based Memory Consolidation mechanism
GAM designs a State-Based Memory Consolidation mechanism that switches between Episodic Buffering State and Semantic Consolidation State using semantic divergence detection instead of arbitrary token or time thresholds.
03
Graph-Guided Multi-Factor Retrieval strategy
GAM proposes a Graph-Guided Multi-Factor Retrieval strategy that combines graph traversal with temporal, confidence, and role-based boosting factors βtime=1.4, βrole=1.4, and βconf=1.2 for precise recall.

RESULTS

By the Numbers

Average F1

40.00

+4.62 over Mem0 on LoCoMo with Qwen 2.5-7B

Average BLEU-1

32.99

+4.32 over Mem0 on LoCoMo with Qwen 2.5-7B

Average F1 LongDialQA

12.55

+5.79 over MemoryOS with Qwen 2.5-7B

Tokens per Query

1370.18

11% fewer tokens than Mem0’s 1533.94 on LoCoMo

On LoCoMo, which tests multi-hop, temporal, open-domain, and single-hop reasoning, GAM reaches 40.00 Average F1 with Qwen 2.5-7B compared to Mem0’s 35.38. On LongDialQA’s multi-party TV-script dialogues, GAM achieves 12.55 Average F1 with Qwen 2.5-7B versus MemoryOS at 6.76, showing that GAM’s hierarchical memory substantially improves long-context reasoning efficiency.

BENCHMARK

By the Numbers

BENCHMARK

LoCoMo Average F1 with Qwen 2.5-7B

Average F1 on LoCoMo across memory systems using the Qwen 2.5-7B-Instruct backbone.

BENCHMARK

LongDialQA Average F1 with Qwen 2.5-7B

Average F1 on LongDialQA for different memory systems using Qwen 2.5-7B-Instruct.

KEY INSIGHT

The Counterintuitive Finding

On LoCoMo, GAM uses only 1370.18 tokens per query yet reaches 40.00 Average F1, while Mem0 needs 1533.94 tokens for 35.38 F1.

This is surprising because unified stream systems like Mem0 expose more raw context, but GAM’s selective consolidation and retrieval show that less, well-structured memory can yield higher accuracy.

WHY IT MATTERS

What this unlocks for the field

GAM unlocks agentic memory that can rapidly encode live dialogue while protecting long-term knowledge from contamination using hierarchical graphs and semantic-event-triggered consolidation.

Builders can now design LLM agents that sustain multi-session, multi-party conversations with stable, inspectable memory graphs instead of opaque, ever-growing context windows.

~13 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…