LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

AuthorsDi Wu, Hongwei Wang, Wenhao Yu et al.

ICLR 20252024

TL;DR

LongMemEval uses a three-stage indexing–retrieval–reading memory design with fact-augmented keys and time-aware queries to expose 30–60% accuracy drops in long-context LLMs on 115k-token chat histories.

THE PROBLEM

Chat assistants forget across 115k token histories and lose 30–60 accuracy points

LongMemEval shows long-context LLMs drop 30%–60% accuracy on LONGMEMEVALS when reading full 115k token histories instead of only evidence sessions.

This means commercial chat assistants and long-context LLMs fail at personalized QA, breaking tasks like counseling, planning, and secretarial support over sustained interactions.

HOW IT WORKS

LongMemEval — indexing, retrieval, and reading over key value query control points

LongMemEval structures long-term memory into three stages, indexing, retrieval, and reading, controlled by four points: value, key, query, and reading strategy.

You can think of LongMemEval like a library: indexing is cataloging books, retrieval is searching the catalog, and reading is a researcher skimming and synthesizing notes.

By decomposing sessions into rounds, expanding keys with user facts, and using time-aware query expansion, LongMemEval enables behaviors that a plain context window cannot, especially for temporal reasoning and multi-session aggregation.

DIAGRAM

Interactive memory use across sessions in LongMemEval

This diagram shows how LongMemEval feeds multi session chat histories into a memory augmented assistant during online evaluation.

DIAGRAM

LongMemEval data and evaluation pipeline

This diagram shows how LongMemEval constructs attribute based questions, evidence sessions, and long chat histories for evaluation.

PROCESS

How LongMemEval Handles a Long Term Interactive Memory Evaluation

  1. 01

    Indexing

    LongMemEval runs indexing by converting each timestamped chat session into key value items, using session or round granularity and optional summaries or user facts.

  2. 02

    Retrieval

    LongMemEval performs retrieval with dense encoders over keys, optionally using fact augmented key expansion and time aware query expansion to narrow the search space.

  3. 03

    Reading

    LongMemEval invokes a reader LLM like GPT-4o or Llama 3.1 with Chain of Note and structured JSON format to extract notes and reason over retrieved items.

  4. 04

    Question Answering

    LongMemEval evaluates question answering by scoring assistant responses with a gpt 4o evaluator, measuring accuracy across information extraction, multi session reasoning, temporal reasoning, knowledge updates, and abstention.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    LongMemEval benchmark design

    LongMemEval introduces 500 manually curated questions over about 50k sessions, covering information extraction, multi session reasoning, temporal reasoning, knowledge updates, and abstention with freely extensible histories up to 1.5M tokens.

  • 02

    Unified memory framework

    LongMemEval formulates long term memory as three stages, indexing, retrieval, and reading, with four control points over value, key, query, and reading strategy that subsume nine existing memory augmented systems.

  • 03

    Memory design optimizations

    LongMemEval proposes session decomposition into rounds, fact augmented key expansion, and time aware query expansion, improving recall@5 by up to 9.4% and boosting downstream accuracy by 5.4 absolute points.

RESULTS

By the Numbers

Accuracy

0.9184

+0.3411 over Coze GPT-4o in commercial assistant comparison

Accuracy

0.5773

vs Offline Reading GPT-4o 0.9184 for ChatGPT memory mode

Accuracy

0.3299

Coze GPT-4o vs Offline Reading GPT-4o 0.9184 reveals 64% drop

Accuracy

0.6400

GPT-4o LONGMEMEVALS with Chain of Note vs 0.9240 oracle evidence only

LongMemEval reports these metrics on LONGMEMEVALS and a 97 question subset for commercial systems, testing long term interactive memory under realistic multi session histories. The main result shows long-context LLMs and commercial assistants lose 30%–60% accuracy when forced to read full 115k token histories instead of oracle evidence, proving that naive long context is insufficient for robust memory.

BENCHMARK

By the Numbers

LongMemEval reports these metrics on LONGMEMEVALS and a 97 question subset for commercial systems, testing long term interactive memory under realistic multi session histories. The main result shows long-context LLMs and commercial assistants lose 30%–60% accuracy when forced to read full 115k token histories instead of oracle evidence, proving that naive long context is insufficient for robust memory.

BENCHMARK

Pilot study of commercial memory augmented chat assistants on LongMemEval

Accuracy on LongMemEval style questions with short multi session histories and offline reading.

BENCHMARK

Long context LLM performance on LONGMEMEVALS with and without Chain of Note

Accuracy on LONGMEMEVALS under full history reading vs oracle evidence sessions.

KEY INSIGHT

The Counterintuitive Finding

LongMemEval shows that GPT-4o drops from 0.924 accuracy with oracle evidence sessions to 0.640 on LONGMEMEVALS when reading full 115k token histories.

This is surprising because long context LLMs are marketed as solving context length issues, yet LongMemEval reveals they still lose around 30% accuracy purely from context sprawl.

WHY IT MATTERS

What this unlocks for the field

LongMemEval unlocks a realistic way to stress test long term interactive memory, including temporal reasoning, knowledge updates, and abstention, over histories up to 1.5M tokens.

Builders can now benchmark and iterate memory systems with explicit control over value granularity, fact augmented keys, and time aware queries, rather than relying on opaque long context behavior.

~14 min read← Back to papers

Related papers

BenchmarkLong-Term Memory

Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs

Mohammad Tavakoli, Alireza Salemi et al.

arXiv 2025 · 2025

LIGHT augments LLMs with **Retrieval from the Conversation**, **Scratchpad Formation and Utilization**, and a **Working Memory** buffer plus noise filtering to answer BEAM’s long-context probing questions. On the BEAM benchmark, LIGHT raises GPT-4.1-nano’s average score at 10M-token conversations from 0.109 to 0.226, a +107.3% gain over the vanilla long-context baseline.

BenchmarkAgent Memory

Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Yuanzhe Hu, Yu Wang, Julian McAuley

ICLR 2026 · 2025

MemoryAgentBench standardizes multi-turn datasets into chunked conversations with memorization prompts, then evaluates long-context agents, RAG agents, and agentic memory agents across Accurate Retrieval, Test-Time Learning, Long-Range Understanding, and Selective Forgetting. On the overall score in Table 3, the GPT-4.1-mini long-context agent reaches 71.8 on Accurate Retrieval tasks compared to 49.2 for the GPT-4o-mini long-context baseline.

BenchmarkAgent Memory

MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents

Haoran Tan, Zeyu Zhang et al.

ACL 2025 · 2025

MemBench evaluates LLM-based agents via a **User Relation Graph**, **Memory Dataset Construction**, **Multi-scenario Memory**, and **Multi-level Memory** pipeline, then scores seven memory mechanisms with multi-metric evaluation. On 100k-token factual observation tests, MemBench shows **RetrievalMemory** achieves 0.933 accuracy versus 0.631 for **FullMemory**, quantifying both effectiveness and efficiency tradeoffs.