AgenticAI-DialogGen: Topic-Guided Conversation Generation for Fine-Tuning and Evaluating Short- and Long-Term Memories of LLMs

AuthorsManoj Madushanka Perera, Adnan Mahmood, Kasun Eranda Wijethilake, Quan Z. Sheng

2026

TL;DR

AgenticAI-DialogGen uses LLM agents plus topic-specific knowledge graphs to generate TopicGuidedChat, where Mistral-7B fine-tuned on TGC / KG reaches 87.36 F1 on memory QA.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

LLMs lack topic continuity and memory-grounded data for long conversations

AgenticAI-DialogGen highlights that existing datasets lack memory grounding, overlook topic continuity, or rely on costly human annotation, limiting memory evaluation.

Without structured short- and long-term context, LLM-based conversational systems struggle to maintain persona-consistent behavior and realistic topic-guided dialogue over extended interactions.

HOW IT WORKS

AgenticAI-DialogGen — modular agents plus topic-specific knowledge graphs

AgenticAI-DialogGen orchestrates ChatPreprocessor, KnowledgeExtractor, TopicAnalyzer, KnowledgeGraphBuilder, PersonaGenerator, DuelingChat Agent, ConversationValidator, ConversationRefiner, QAGeneration, and PostProcessing to transform unstructured conversations into topic-guided, persona-grounded data.

You can think of AgenticAI-DialogGen as a pipeline that first builds a card catalog of facts and topics, then uses them as a script for simulated speakers with persistent traits.

This design lets AgenticAI-DialogGen encode long-term memory in compact knowledge graphs while generating new short-term conversations and QA pairs that a plain context window alone cannot support.

DIAGRAM

Topic-guided conversation simulation between personas

This diagram shows how AgenticAI-DialogGen uses persona profiles and topic-specific knowledge to simulate multi-turn topic-guided conversations with turn alternation.

DIAGRAM

Generation and evaluation pipeline for TGC and memory QA

This diagram shows how AgenticAI-DialogGen builds the TGC dataset and evaluates LLMs on memory-grounded QA using structured and unstructured long-term context.

PROCESS

How AgenticAI-DialogGen Handles TopicGuidedChat Generation

  1. 01

    ChatPreprocessor Module

    AgenticAI-DialogGen uses ChatPreprocessor to assign consistent speaker identifiers, normalize text, and produce an ordered sequence of turns T for each speaker pair.

  2. 02

    KnowledgeExtractor Module

    AgenticAI-DialogGen feeds the cleaned conversation into KnowledgeExtractor, which extracts subject relation object triples K with turn indices and filters malformed or low confidence triples.

  3. 03

    TopicAnalyzer Module

    AgenticAI-DialogGen runs TopicAnalyzer to cluster triples into topic groups Ttopics with names, keyword sets, and importance scores, then selects the top N topics per speaker pair.

  4. 04

    KnowledgeGraphBuilder Module

    AgenticAI-DialogGen invokes KnowledgeGraphBuilder to convert topic triples into MultiDiGraph knowledge graphs Gj and derive speaker-specific knowledge sets Kn_j for downstream persona and QA generation.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    AgenticAI-DialogGen framework

    AgenticAI-DialogGen introduces a modular agent-based framework combining KnowledgeExtractor, TopicAnalyzer, KnowledgeGraphBuilder, PersonaGenerator, and DuelingChat Agent to generate persona-grounded, topic-guided conversations without human supervision.

  • 02

    TopicGuidedChat dataset

    AgenticAI-DialogGen releases the TGC dataset with 1,001 persona pairs, 2,950 topics, 88,500 conversational turns, and 59,000 memory QA pairs, encoding long-term knowledge graphs and short-term topic-guided conversations.

  • 03

    Memory-grounded QA evaluation

    AgenticAI-DialogGen demonstrates that Mistral-7B fine-tuned on TGC / KG reaches 87.36 F1 on memory QA, surpassing GPT-4 zero-shot at 83.77 F1 and showing the benefit of structured long-term memory.

RESULTS

By the Numbers

F1

87.36

+3.59 over GPT-4 zero-shot on TGC / KG

Precision

85.51

vs GPT-4 precision 82.10 on TGC / KG

Recall

89.29

higher recall than GPT-4 recall 85.50 on TGC / KG

Exact Match

38.38

+2.51 over Claude 3.5 Sonnet exact match 35.87 on TGC / KG

These metrics come from the memory QA evaluation on the TGC / KG benchmark, which probes short- and long-term recall grounded in knowledge graphs and simulated conversations. The results show that AgenticAI-DialogGen enables lightweight Mistral-7B to surpass larger zero-shot models on structured memory retrieval.

BENCHMARK

By the Numbers

These metrics come from the memory QA evaluation on the TGC / KG benchmark, which probes short- and long-term recall grounded in knowledge graphs and simulated conversations. The results show that AgenticAI-DialogGen enables lightweight Mistral-7B to surpass larger zero-shot models on structured memory retrieval.

BENCHMARK

Performance comparison of LLMs on memory QA evaluation (zero-shot and few-shot settings)

F1 on memory-grounded QA for TGC / KG across zero-shot GPT-4 and fine-tuned lightweight models.

KEY INSIGHT

The Counterintuitive Finding

AgenticAI-DialogGen shows that Mistral-7B fine-tuned on TGC / KG reaches 87.36 F1, beating GPT-4 zero-shot at 83.77 F1 on the same memory QA task.

This is surprising because many practitioners assume larger proprietary LLMs always dominate, yet structured knowledge graphs plus topic-guided data let a smaller model close and surpass that gap.

WHY IT MATTERS

What this unlocks for the field

AgenticAI-DialogGen unlocks scalable generation of topic-guided, persona-grounded datasets with explicit short- and long-term memory, including 59,000 QA pairs grounded in knowledge graphs and conversations.

Builders can now fine-tune lightweight LLMs to exhibit strong memory-aware behavior using TGC, making long-context, persona-consistent conversational agents practical without relying solely on massive context windows.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

Questions about this paper?

Paper: AgenticAI-DialogGen: Topic-Guided Conversation Generation for Fine-Tuning and Evaluating Short- and Long-Term Memories of LLMs

Answers use this explainer on Memory Papers.

Checking…