SGMem: Sentence Graph Memory for Long-Term Conversational Agents

AuthorsYaxiong Wu, Yongyue Zhang, Sheng Liang, Yong Liu

2025

TL;DR

SGMem uses sentence graph memory over chunked dialogue to align raw history with generated memory, reaching 0.700 Accuracy on LongMemEval vs 0.676 for RAG-SMFI.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Memory fragmentation in long term QA with fragmented raw and generated histories

Long term conversational agents face memory overload, where stored content exceeds their ability to manage and retrieve it effectively, causing fragmented context.

Existing RAG based QA systems scatter information across raw dialogue and generated memory, leading to memory fragmentation that undermines temporal reasoning and coherent, user tailored responses.

HOW IT WORKS

Sentence Graph Memory framework in SGMem

SGMem centers on SGMem Construction and Management and SGMem Usage, building sentence level graphs that connect sessions, rounds, turns, sentences, summaries, facts, and insights.

You can think of SGMem as a conversational card catalog, where each sentence is an index card linked to related cards and higher level folders like sessions and rounds.

This sentence graph memory lets SGMem traverse multi hop associations and align raw dialogue with generated memory, providing coherent context that a plain context window cannot assemble.

DIAGRAM

Query time retrieval and expansion in SGMem Usage

This diagram shows how SGMem Usage retrieves heterogeneous memories, expands sentences via graph traversal, ranks chunks, and feeds context to the LLM for QA.

DIAGRAM

Evaluation pipeline and ablation design for SGMem

This diagram shows how SGMem is evaluated on LongMemEval and LoCoMo with different RAG and SGMem variants and hyperparameter ablations.

PROCESS

How SGMem Handles a Long Term Conversational Question Answering Session

01
Processing Conversations
SGMem decomposes sessions into rounds, turns, and sentences, and uses an LLM to generate summaries, facts, and insights that complement the raw dialogue.
02
Indexing
SGMem embeds sessions, rounds, turns, sentences, summaries, facts, and insights with Sentence BERT, building seven searchable index tables in a vector database.
03
Constructing Sentence Graph Memory
SGMem links chunk nodes to their sentences and connects sentences via k nearest neighbor similarity edges, forming a sentence graph stored in a graph database.
04
SGMem Usage
SGMem retrieves heterogeneous memories, expands sentences via h hop traversal, ranks parent chunks, aggregates Crelevant, and conditions the LLM to generate the final answer.

KEY CONTRIBUTIONS

Key Contributions

01
Sentence Graph Memory Construction and Management
SGMem introduces hierarchical sentence graph memory that links chunk nodes and sentences via similarity edges, mitigating memory fragmentation across turns, rounds, and sessions.
02
Sentence Graph Memory Usage
SGMem designs a multi hop retrieval mechanism over sentence graphs that integrates raw dialogue with summaries, facts, and insights into a coherent Crelevant for QA.
03
Comprehensive Evaluation
SGMem is extensively evaluated on LongMemEval and LoCoMo, achieving up to 0.700 Accuracy on LongMemEval and 0.526 on LoCoMo, surpassing strong memory and RAG baselines.

RESULTS

By the Numbers

Accuracy Top 5 LongMemEval

0.700

+0.024 over RAG-SMFI

Accuracy Top 10 LongMemEval

0.730

+0.050 over RAG-SMFI

Accuracy Top 5 LoCoMo

0.526

+0.016 over RAG-SMFI

Accuracy Top 10 LoCoMo

0.532

+0.004 over RAG-SMFI

Table 1 reports Accuracy on LongMemEval and LoCoMo, which test long term conversational QA with multi session, temporal, and knowledge update queries. The gains show that SGMem leverages sentence graph memory to retrieve more coherent context than RAG-SMFI despite using the same underlying LLM.

BENCHMARK

By the Numbers

BENCHMARK

Performance comparison on LongMemEval using Accuracy Top 5

Accuracy Top 5 on LongMemEval for SGMem and strong RAG baselines.

KEY INSIGHT

The Counterintuitive Finding

On LoCoMo, SGMem SF reaches 0.522 Accuracy Top 5, slightly above SGMem SMFI at 0.526, despite using fewer memory types than full SMFI.

This is surprising because adding more summaries, facts, and insights is expected to always help, but SGMem shows that excessive generated memory can introduce noise in very long conversations.

WHY IT MATTERS

What this unlocks for the field

SGMem unlocks sentence level graph memory that aligns raw dialogue and generated memory without expensive entity extraction, enabling coherent retrieval across long, multi session histories.

Builders can now deploy long term conversational agents that scale beyond context windows while preserving fine grained temporal and personalization cues that were previously lost or fragmented.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…