AgenticAI-DialogGen: Topic-Guided Conversation Generation for Fine-Tuning and Evaluating Short- and Long-Term Memories of LLMs

AuthorsManoj Madushanka Perera, Adnan Mahmood, Kasun Eranda Wijethilake, Quan Z. Sheng

2026

TL;DR

AgenticAI-DialogGen uses LLM agents plus topic-specific knowledge graphs to generate TopicGuidedChat, where Mistral-7B fine-tuned on TGC / KG reaches 87.36 F1 on memory QA.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

LLMs lack topic continuity and memory-grounded data for long conversations

AgenticAI-DialogGen highlights that existing datasets lack memory grounding, overlook topic continuity, or rely on costly human annotation, limiting memory evaluation.

Without structured short- and long-term context, LLM-based conversational systems struggle to maintain persona-consistent behavior and realistic topic-guided dialogue over extended interactions.

HOW IT WORKS

AgenticAI-DialogGen — modular agents plus topic-specific knowledge graphs

AgenticAI-DialogGen orchestrates ChatPreprocessor, KnowledgeExtractor, TopicAnalyzer, KnowledgeGraphBuilder, PersonaGenerator, DuelingChat Agent, ConversationValidator, ConversationRefiner, QAGeneration, and PostProcessing to transform unstructured conversations into topic-guided, persona-grounded data.

You can think of AgenticAI-DialogGen as a pipeline that first builds a card catalog of facts and topics, then uses them as a script for simulated speakers with persistent traits.

This design lets AgenticAI-DialogGen encode long-term memory in compact knowledge graphs while generating new short-term conversations and QA pairs that a plain context window alone cannot support.

DIAGRAM

Topic-guided conversation simulation between personas

This diagram shows how AgenticAI-DialogGen uses persona profiles and topic-specific knowledge to simulate multi-turn topic-guided conversations with turn alternation.

DIAGRAM

Generation and evaluation pipeline for TGC and memory QA

This diagram shows how AgenticAI-DialogGen builds the TGC dataset and evaluates LLMs on memory-grounded QA using structured and unstructured long-term context.

PROCESS

How AgenticAI-DialogGen Handles TopicGuidedChat Generation

01
ChatPreprocessor Module
AgenticAI-DialogGen uses ChatPreprocessor to assign consistent speaker identifiers, normalize text, and produce an ordered sequence of turns T for each speaker pair.
02
KnowledgeExtractor Module
AgenticAI-DialogGen feeds the cleaned conversation into KnowledgeExtractor, which extracts subject relation object triples K with turn indices and filters malformed or low confidence triples.
03
TopicAnalyzer Module
AgenticAI-DialogGen runs TopicAnalyzer to cluster triples into topic groups Ttopics with names, keyword sets, and importance scores, then selects the top N topics per speaker pair.
04
KnowledgeGraphBuilder Module
AgenticAI-DialogGen invokes KnowledgeGraphBuilder to convert topic triples into MultiDiGraph knowledge graphs Gj and derive speaker-specific knowledge sets Kn_j for downstream persona and QA generation.

KEY CONTRIBUTIONS

Key Contributions

01
AgenticAI-DialogGen framework
AgenticAI-DialogGen introduces a modular agent-based framework combining KnowledgeExtractor, TopicAnalyzer, KnowledgeGraphBuilder, PersonaGenerator, and DuelingChat Agent to generate persona-grounded, topic-guided conversations without human supervision.
02
TopicGuidedChat dataset
AgenticAI-DialogGen releases the TGC dataset with 1,001 persona pairs, 2,950 topics, 88,500 conversational turns, and 59,000 memory QA pairs, encoding long-term knowledge graphs and short-term topic-guided conversations.
03
Memory-grounded QA evaluation
AgenticAI-DialogGen demonstrates that Mistral-7B fine-tuned on TGC / KG reaches 87.36 F1 on memory QA, surpassing GPT-4 zero-shot at 83.77 F1 and showing the benefit of structured long-term memory.

RESULTS

By the Numbers

87.36

+3.59 over GPT-4 zero-shot on TGC / KG

Precision

85.51

vs GPT-4 precision 82.10 on TGC / KG

Recall

89.29

higher recall than GPT-4 recall 85.50 on TGC / KG

Exact Match

38.38

+2.51 over Claude 3.5 Sonnet exact match 35.87 on TGC / KG

These metrics come from the memory QA evaluation on the TGC / KG benchmark, which probes short- and long-term recall grounded in knowledge graphs and simulated conversations. The results show that AgenticAI-DialogGen enables lightweight Mistral-7B to surpass larger zero-shot models on structured memory retrieval.

BENCHMARK

By the Numbers

BENCHMARK

Performance comparison of LLMs on memory QA evaluation (zero-shot and few-shot settings)

F1 on memory-grounded QA for TGC / KG across zero-shot GPT-4 and fine-tuned lightweight models.

KEY INSIGHT

The Counterintuitive Finding

AgenticAI-DialogGen shows that Mistral-7B fine-tuned on TGC / KG reaches 87.36 F1, beating GPT-4 zero-shot at 83.77 F1 on the same memory QA task.

This is surprising because many practitioners assume larger proprietary LLMs always dominate, yet structured knowledge graphs plus topic-guided data let a smaller model close and surpass that gap.

WHY IT MATTERS

What this unlocks for the field

AgenticAI-DialogGen unlocks scalable generation of topic-guided, persona-grounded datasets with explicit short- and long-term memory, including 59,000 QA pairs grounded in knowledge graphs and conversations.

Builders can now fine-tune lightweight LLMs to exhibit strong memory-aware behavior using TGC, making long-context, persona-consistent conversational agents practical without relying solely on massive context windows.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…