Chronos: Temporal-Aware Conversational Agents with Structured Event Retrieval for Long-Term Memory

AuthorsSahil Sen, Elias Lumer, Anmol Gulati, Vamse Kumar Subbiah

2026

TL;DR

Chronos uses dual calendars plus dynamic prompting and structured temporal event retrieval to reach 95.60% accuracy on LongMemEvalS, +7.67% over EmergenceMem Internal.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Long-term agents lack temporal grounding despite 92.60% vs 86.00% gap

Existing conversational memory systems either overbuild global knowledge graphs or rely on shallow turn retrieval, failing on time-sensitive multi-session queries.

On LongMemEvalS, EmergenceMem Internal only reaches 86.00% accuracy, showing that long-term assistants still mis-handle temporal reasoning and knowledge updates across months of interaction.

HOW IT WORKS

Chronos — Dual Calendars and Dynamic Prompting for Temporal Memory

Chronos centers on Event Extraction, Dynamic Prompting, Initial Retrieval, and the Chronos Agent, backed by a structured event calendar and raw turn calendar.

You can think of Chronos like a calendar plus diary: the event calendar is a timestamped index, while the turn calendar is the full narrative notebook.

By explicitly structuring temporal events while keeping full dialogue, Chronos enables precise time filtering and multi-hop reasoning that a plain context window cannot support.

DIAGRAM

Chronos Query-Time Memory Retrieval Loop

Sequence of interactions between the user, Chronos Agent, and the dual calendars during query-time tool-calling.

DIAGRAM

Chronos Evaluation and Ablation Pipeline

Flow of evaluating Chronos configurations and ablations on the LongMemEvalS benchmark.

PROCESS

How Chronos Handles a Long-Term Memory Query

  1. 01

    Event Extraction

    Chronos runs the Event Extraction pipeline over conversation turns, producing subject verb object tuples with ISO 8601 datetime ranges and lexical aliases into the event calendar.

  2. 02

    Dynamic Prompting

    For each new question, Chronos uses Dynamic Prompting to analyze the query and generate retrieval guidance bullets describing targets and temporal constraints.

  3. 03

    Initial Retrieval

    Chronos performs Initial Retrieval over the turn calendar using dense search, Cohere Rerank v3, and context expansion around the top 15 turns.

  4. 04

    Chronos Agent

    The Chronos Agent runs a ReAct loop, calling vector and grep tools over the event calendar and turn calendar until it can answer with temporally grounded reasoning.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Chronos Architecture

    Chronos introduces dual event calendar and turn calendar stores plus the Chronos Agent, achieving 92.60% and 95.60% accuracy on LongMemEvalS under Low and High configurations.

  • 02

    Dynamic Prompting for Memory

    Chronos extends Dynamic Prompting to long-term memory, generating per-question retrieval guidance instead of static query rewriting across temporal, preference, and aggregation tasks.

  • 03

    Structured Event Retrieval Gains

    Chronos ablations show the event calendar yields a 58.9% gain over the baseline, while other components each add between 15.5% and 22.3% accuracy improvements.

RESULTS

By the Numbers

Overall

92.60%

+7.67 over EmergenceMem Internal

Knowledge Update

96.15%

+12.82 over EmergenceMem Internal

Multi Session

91.73%

+10.53 over EmergenceMem Internal

Temporal Reasoning

90.23%

+4.52 over EmergenceMem Internal

Chronos is evaluated on the LongMemEvalS benchmark with 500 questions across six categories, testing knowledge updates, aggregation, and temporal reasoning. The 92.60% and 95.60% overall scores show Chronos handles long-horizon, time-grounded memory substantially better than EmergenceMem Internal and Mastra.

BENCHMARK

By the Numbers

Chronos is evaluated on the LongMemEvalS benchmark with 500 questions across six categories, testing knowledge updates, aggregation, and temporal reasoning. The 92.60% and 95.60% overall scores show Chronos handles long-horizon, time-grounded memory substantially better than EmergenceMem Internal and Mastra.

BENCHMARK

Comparison of Chronos Low with State-of-the-Art Systems on LongMemEvalS

Overall accuracy (%) on LongMemEvalS across practical conversational memory systems.

BENCHMARK

High-Configuration Accuracy on LongMemEvalS

Overall accuracy (%) for Chronos High and strong baselines under advanced LLM configurations.

KEY INSIGHT

The Counterintuitive Finding

Removing the event calendar almost halves Chronos Low’s accuracy, dropping from 93.1% to 58.6% on the 116-question ablation subset.

This is surprising because many systems assume dense turn retrieval is sufficient, yet Chronos shows structured temporal events dominate gains while using relatively simple SVO tuples.

WHY IT MATTERS

What this unlocks for the field

Chronos unlocks reliable, time-aware conversational memory, letting agents answer questions like “What did I do the week after my vacation?” months later.

Builders can now design assistants that track evolving preferences, knowledge updates, and cross-session event counts without heavyweight knowledge graphs or full-context replay.

~14 min read← Back to papers

Related papers

Long-Term Memory

Adaptive Memory Admission Control for LLM Agents

Guilin Zhang, Wei Jiang et al.

· 2026

A-MAC scores candidate memories using Utility, Confidence, Novelty, Recency, and Type Prior combined by a learned linear admission policy with Algorithm 1 A-MAC Memory Admission. On the LoCoMo benchmark, A-MAC achieves F1 0.583 and 2644 ms latency, improving F1 by 0.042 and reducing latency by 1187 ms compared to A-mem.

Agent MemoryLong-Term Memory

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu, Liuyi Yao et al.

arXiv 2026 · 2026

Agentic Memory (AgeMem) exposes memory management tools, a three-stage progressive RL strategy, and step-wise GRPO directly inside the agent policy to jointly control long-term and short-term memory. On Qwen3-4B-Instruct, AgeMem attains 54.31% average performance across ALFWorld, SciWorld, PDDL, BabyAI, and HotpotQA, exceeding the best baseline A-Mem at 45.74%.

Questions about this paper?

Paper: Chronos: Temporal-Aware Conversational Agents with Structured Event Retrieval for Long-Term Memory

Answers use this explainer on Memory Papers.

Checking…