PerLTQA: A Personal Long-Term Memory Dataset for Memory Classification, Retrieval, and Synthesis in Question Answering

AuthorsYiming Du, Hongru Wang, Zhengyi Zhao et al.

2024

TL;DR

PerLTQA uses Memory Classification, Memory Retrieval, and Memory Synthesis over a 8,593‑QA personal long‑term memory benchmark to show BERT-base reaches 95.7 F1 in memory typing.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Personal QA lacks integrated semantic and episodic memory for 8,593 questions

Existing QA and dialogue datasets rarely combine profiles, social relationships, events, and dialogues into one benchmark like PerLTQA’s 8,593 questions for 30 characters.

Without unified semantic and episodic memory, personal QA systems cannot reliably answer questions grounded in long-term histories, harming personalization and factual consistency.

HOW IT WORKS

PerLTQA framework — Memory Classification, Retrieval, and Synthesis

PerLTQA routes questions through Memory Classification, Memory Retrieval, and Memory Synthesis over a structured memory database of profiles, social relationships, events, and dialogues.

You can think of PerLTQA like a brain that first decides which memory drawer to open, then pulls relevant cards, and finally writes a coherent answer using those cards.

This staged design lets PerLTQA integrate targeted long-term memories beyond a fixed context window, enabling precise, memory-grounded answers instead of generic or hallucinated replies.

DIAGRAM

PerLTQA query-time flow from question to memory-grounded answer

This diagram shows how PerLTQA processes a single evaluation question through memory classification, retrieval, rescoring, and synthesis.

DIAGRAM

PerLTQA dataset generation pipeline for personal long-term memory

This diagram shows the six-step pipeline PerLTQA uses to construct profiles, relationships, events, dialogues, and QA pairs.

PROCESS

How PerLTQA Handles a Memory Question Answering Task

  1. 01

    Memory Classification

    PerLTQA first runs the question through Memory Classification, often with a BERT-base classifier, to decide between semantic and episodic memory types.

  2. 02

    Memory Retrieval

    PerLTQA then uses retrieval models like BM25, DPR, or Contriever to fetch k memories per type from the PerLT Memory database.

  3. 03

    Rescoring with Classification Probabilities

    PerLTQA rescales retrieval scores using α·P(π|mi)+β·sigmoid(si), leveraging Memory Classification probabilities to re-rank 2k candidates.

  4. 04

    Memory Synthesis

    Finally, PerLTQA feeds the re-ranked memories and question into Memory Synthesis with LLMs such as gpt-3.5-turbo to generate answers under 50 words.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    PerLTQA dataset introduction

    PerLTQA releases a memory database with 141 profiles, 1,339 social relationships, 4,501 events, 3,409 dialogues, and 8,593 memory-related questions for 30 characters.

  • 02

    Three subtasks for memory utilization

    PerLTQA defines Memory Classification, Memory Retrieval, and Memory Synthesis subtasks to systematically evaluate how LLMs use personal long-term memory.

  • 03

    Benchmarking LLMs and retrievers

    PerLTQA benchmarks five LLMs and three retrievers, showing BERT-base reaches 0.957 F1 in Memory Classification and gpt-3.5-turbo attains MAP 0.756 in Memory Synthesis.

RESULTS

By the Numbers

Weighted F1 (classification)

0.957

+0.242 over ChatGLM2-6B few-shot F1 0.715

Accuracy (classification)

0.956

+0.144 over gpt-3.5-turbo instruction Acc 0.812 (approx from P/R/F1 trend)

MAP (synthesis W-MC+R)

0.756

+0.052 over Baichuan2-7B MAP 0.704

Recall@1 (retrieval)

0.705

+0.103 over DPR Recall@1 0.602 with BM25

PerLTQA evaluates Memory Classification, Retrieval, and Synthesis on its 8,593 QA benchmark, focusing on semantic and episodic personal memories. The results show PerLTQA’s pipeline lets BERT-base and gpt-3.5-turbo achieve strong memory typing and memory-grounded answering compared to open-source LLMs and unsupervised retrievers.

BENCHMARK

By the Numbers

PerLTQA evaluates Memory Classification, Retrieval, and Synthesis on its 8,593 QA benchmark, focusing on semantic and episodic personal memories. The results show PerLTQA’s pipeline lets BERT-base and gpt-3.5-turbo achieve strong memory typing and memory-grounded answering compared to open-source LLMs and unsupervised retrievers.

BENCHMARK

Memory classification performance on PerLTQA

Weighted F1 for memory type classification on the PerLTQA test set.

BENCHMARK

Memory synthesis MAP with and without retrieval on PerLTQA

MAP of memory anchors for different LLMs under W-MC+R on PerLTQA.

KEY INSIGHT

The Counterintuitive Finding

PerLTQA shows that gpt-3.5-turbo’s few-shot Memory Classification actually drops compared to instruction-only, with recall falling from 0.668 to 0.511.

This is surprising because few-shot prompting is usually assumed to help LLMs, yet PerLTQA reveals that extra examples can confuse memory-type decisions.

WHY IT MATTERS

What this unlocks for the field

PerLTQA unlocks a controlled way to test how systems integrate semantic and episodic personal memories across 8,593 QA pairs with dense anchor annotations.

Builders can now prototype memory-augmented agents, swap classifiers and retrievers, and quantify gains in memory-grounded answering rather than relying on ad-hoc long-context prompts.

~12 min read← Back to papers

Related papers

Agent MemoryLong-Term Memory

Adaptive Memory Admission Control for LLM Agents

Guilin Zhang, Wei Jiang et al.

· 2026

A-MAC scores candidate memories using Utility, Confidence, Novelty, Recency, and Type Prior combined by a learned linear admission policy with Algorithm 1 A-MAC Memory Admission. On the LoCoMo benchmark, A-MAC achieves F1 0.583 and 2644 ms latency, improving F1 by 0.042 and reducing latency by 1187 ms compared to A-mem.

Long-Term Memory

Advancing Open-source World Models

Robbyant Team, Zelin Gao et al.

arXiv 2026 · 2026

LingBot-World combines a Data Engine, Fundamental World Model, Action-Conditioned World Model, and Post-Training causal adaptation to turn a 28B-parameter video generator into a real-time interactive world simulator. On the VBench benchmark, LingBot-World achieves a dynamic degree of 0.8857 versus 0.7612 for Yume-1.5, while also improving imaging quality to 0.6683.

BenchmarkBenchmarkLong-Term Memory

AgenticAI-DialogGen: Topic-Guided Conversation Generation for Fine-Tuning and Evaluating Short- and Long-Term Memories of LLMs

Manoj Madushanka Perera, Adnan Mahmood et al.

· 2026

AgenticAI-DialogGen chains ChatPreprocessor, KnowledgeExtractor, TopicAnalyzer, KnowledgeGraphBuilder, PersonaGenerator, DuelingChat Agent, ConversationValidator, ConversationRefiner, QAGeneration, and PostProcessing to turn raw multi-session chats into topic-guided, persona-grounded conversations with explicit short- and long-term memories. On the TGC / KG memory QA benchmark, Mistral-7B fine-tuned within AgenticAI-DialogGen achieves 87.36 F1, compared to GPT-4’s 83.77 F1 in a zero-shot setting on the same task.

Questions about this paper?

Paper: PerLTQA: A Personal Long-Term Memory Dataset for Memory Classification, Retrieval, and Synthesis in Question Answering

Answers use this explainer on Memory Papers.

Checking…