Toward Conversational Agents with Context and Time Sensitive Long-term Memory

AuthorsNick Alonso, Tomás Figliolia, Anthony Ndirango, Beren Millidge

2024

TL;DR

Toward Conversational Agents with Context and Time Sensitive Long-term Memory combines chain-of-tables meta-data search with semantic retrieval to reach 90.32 recall vs 31.93 for semantic baselines.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Conversational RAG fails on meta data and ambiguity

Toward Conversational Agents with Context and Time Sensitive Long-term Memory shows that standard RAG systems handle ambiguous and time-based questions poorly, especially for long-form dialogues.

When conversational agents cannot retrieve by time, session, or speaker, they fail on realistic queries like “what did we discuss yesterday morning,” breaking long-term assistant behavior.

HOW IT WORKS

Chain-of-tables plus semantic retrieval for conversational memory

Toward Conversational Agents with Context and Time Sensitive Long-term Memory combines a Tabular Chat Database, Classifying Query Type, Chain-of-Tables for Meta-Data Retrieval, and Combining Meta-Data and Semantic Retrieval into one retrieval pipeline.

You can think of the tabular side as a card catalog keyed by timestamps and speakers, while semantic embeddings act like a content-based search engine layered on top.

This hybrid mechanism lets Toward Conversational Agents with Context and Time Sensitive Long-term Memory answer queries like “that thing last Tuesday” that a plain context window or pure vector search cannot reliably resolve.

DIAGRAM

Ambiguous conversational query resolution flow

This diagram shows how Toward Conversational Agents with Context and Time Sensitive Long-term Memory rewrites ambiguous questions using preceding context before routing them to meta-data and semantic retrieval.

DIAGRAM

Evaluation pipeline on LoCoMo temporal benchmark

This diagram shows how Toward Conversational Agents with Context and Time Sensitive Long-term Memory is evaluated on LoCoMo-derived time, ambiguous, and time plus content questions using recall and F2.

PROCESS

How Toward Conversational Agents with Context and Time Sensitive Long-term Memory Handles a Conversational Query

01
Tabular Chat Database
Toward Conversational Agents with Context and Time Sensitive Long-term Memory first logs each response as a row with speaker, date, time, session number, and a Content column index.
02
Classifying Query Type
Toward Conversational Agents with Context and Time Sensitive Long-term Memory uses an LLM classifier to decide if the query needs meta data retrieval, semantic retrieval, or both.
03
Chain-of-Tables for Meta-Data Retrieval
Toward Conversational Agents with Context and Time Sensitive Long-term Memory applies f_value and f_between chains to the Tabular Chat Database to filter rows by time, date, or speaker.
04
Combining Meta-Data and Semantic Retrieval
Toward Conversational Agents with Context and Time Sensitive Long-term Memory runs semantic top k retrieval over filtered rows or the full table, then returns relevant responses for answer generation.

KEY CONTRIBUTIONS

Key Contributions

01
Dataset of ambiguous and time based questions
Toward Conversational Agents with Context and Time Sensitive Long-term Memory builds on LoCoMo by generating 11 time based query types plus ambiguous and time plus content questions with explicit relevant response lists.
02
Combined CoTable plus semantic retrieval model
Toward Conversational Agents with Context and Time Sensitive Long-term Memory integrates Chain-of-Tables for Meta-Data Retrieval with semantic vector search and a meta semantic classifier to handle conversational meta data queries.
03
Improved recall on temporal conversational tasks
Toward Conversational Agents with Context and Time Sensitive Long-term Memory reaches 90.32 average recall and 55.27 F2, compared to 31.93 recall and 7.99 F2 for the best Semantic w MetaD baseline.

RESULTS

By the Numbers

Average Recall

90.32

+58.39 over Semantic w MetaD (k=30)

Average F2

55.27

+47.28 over Semantic w MetaD (k=30)

Time Qs Recall

90.47

+83.00 over Semantic w MetaD (k=30)

Time+Content Qs Recall

90.17

+33.77 over Semantic w MetaD (k=30)

On LoCoMo derived time based and time plus content questions, Toward Conversational Agents with Context and Time Sensitive Long-term Memory is compared against Semantic and Semantic w MetaD baselines. The MAIN_RESULT shows that combining Chain-of-Tables and semantic retrieval dramatically boosts recall on temporal conversational memory tasks.

BENCHMARK

By the Numbers

BENCHMARK

Average Recall and F2 across time and time plus content questions

Average Recall on LoCoMo derived temporal queries.

KEY INSIGHT

The Counterintuitive Finding

Toward Conversational Agents with Context and Time Sensitive Long-term Memory shows that Semantic w MetaD reaches only 31.93 average recall, despite embedding meta data directly into text.

This is surprising because many practitioners assume concatenating timestamps and speakers into chunks makes pure vector search sufficient, but the numbers show that assumption fails badly for temporal queries.

WHY IT MATTERS

What this unlocks for the field

Toward Conversational Agents with Context and Time Sensitive Long-term Memory unlocks conversational agents that can answer questions keyed by time, session order, and ambiguous pronouns over long histories.

Builders can now design assistants that remember “what we talked about last Tuesday afternoon” without massive context windows, by combining tabular meta data search with semantic retrieval.

~12 min read← Back to papers

Related papers

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

RAG

A Dynamic Retrieval-Augmented Generation System with Selective Memory and Remembrance

Okan Bursa

· 2026

Adaptive RAG Memory (ARM) augments a standard retriever–generator stack with a Dynamic Embedding Layer and Remembrance Engine that track usage statistics and apply selective remembrance and decay to embeddings. On a lightweight retrieval benchmark, ARM achieves NDCG@5 ≈ 0.9401 and Recall@5 = 1.000 with 22M parameters, matching larger baselines like gte-small while providing the best efficiency among ultra-efficient models.

arXiv:2601.02428 Read explainer

RAGLong-Term Memory

HingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable Dialogues

Yijie Zhong, Yunfan Gao, Haofen Wang

· 2026

HingeMem combines Boundary Guided Long-Term Memory, Dialogue Boundary Extraction, Memory Construction, Query Adaptive Retrieval, Hyperedge Rerank, and Adaptive Stop to segment dialogues into element-indexed hyperedges and plan query-specific retrieval. On LOCOMO, HingeMem achieves 63.9 overall F1 and 75.1 LLM-as-a-Judge score, surpassing the best baseline Zep (56.9 F1) by 7.0 F1 without using category-specific QA formats.

arXiv:2604.06845 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…