LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

AuthorsDi Wu, Hongwei Wang, Wenhao Yu et al.

ICLR 20252024

TL;DR

LongMemEval uses a three-stage indexing–retrieval–reading memory design with fact-augmented keys and time-aware queries to expose 30–60% accuracy drops in long-context LLMs on 115k-token chat histories.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Chat assistants forget across 115k token histories and lose 30–60 accuracy points

LongMemEval shows long-context LLMs drop 30%–60% accuracy on LONGMEMEVALS when reading full 115k token histories instead of only evidence sessions.

This means commercial chat assistants and long-context LLMs fail at personalized QA, breaking tasks like counseling, planning, and secretarial support over sustained interactions.

HOW IT WORKS

LongMemEval — indexing, retrieval, and reading over key value query control points

LongMemEval structures long-term memory into three stages, indexing, retrieval, and reading, controlled by four points: value, key, query, and reading strategy.

You can think of LongMemEval like a library: indexing is cataloging books, retrieval is searching the catalog, and reading is a researcher skimming and synthesizing notes.

By decomposing sessions into rounds, expanding keys with user facts, and using time-aware query expansion, LongMemEval enables behaviors that a plain context window cannot, especially for temporal reasoning and multi-session aggregation.

DIAGRAM

Interactive memory use across sessions in LongMemEval

This diagram shows how LongMemEval feeds multi session chat histories into a memory augmented assistant during online evaluation.

DIAGRAM

LongMemEval data and evaluation pipeline

This diagram shows how LongMemEval constructs attribute based questions, evidence sessions, and long chat histories for evaluation.

PROCESS

How LongMemEval Handles a Long Term Interactive Memory Evaluation

01
Indexing
LongMemEval runs indexing by converting each timestamped chat session into key value items, using session or round granularity and optional summaries or user facts.
02
Retrieval
LongMemEval performs retrieval with dense encoders over keys, optionally using fact augmented key expansion and time aware query expansion to narrow the search space.
03
Reading
LongMemEval invokes a reader LLM like GPT-4o or Llama 3.1 with Chain of Note and structured JSON format to extract notes and reason over retrieved items.
04
Question Answering
LongMemEval evaluates question answering by scoring assistant responses with a gpt 4o evaluator, measuring accuracy across information extraction, multi session reasoning, temporal reasoning, knowledge updates, and abstention.

KEY CONTRIBUTIONS

Key Contributions

01
LongMemEval benchmark design
LongMemEval introduces 500 manually curated questions over about 50k sessions, covering information extraction, multi session reasoning, temporal reasoning, knowledge updates, and abstention with freely extensible histories up to 1.5M tokens.
02
Unified memory framework
LongMemEval formulates long term memory as three stages, indexing, retrieval, and reading, with four control points over value, key, query, and reading strategy that subsume nine existing memory augmented systems.
03
Memory design optimizations
LongMemEval proposes session decomposition into rounds, fact augmented key expansion, and time aware query expansion, improving recall@5 by up to 9.4% and boosting downstream accuracy by 5.4 absolute points.

RESULTS

By the Numbers

Accuracy

0.9184

+0.3411 over Coze GPT-4o in commercial assistant comparison

Accuracy

0.5773

vs Offline Reading GPT-4o 0.9184 for ChatGPT memory mode

Accuracy

0.3299

Coze GPT-4o vs Offline Reading GPT-4o 0.9184 reveals 64% drop

Accuracy

0.6400

GPT-4o LONGMEMEVALS with Chain of Note vs 0.9240 oracle evidence only

LongMemEval reports these metrics on LONGMEMEVALS and a 97 question subset for commercial systems, testing long term interactive memory under realistic multi session histories. The main result shows long-context LLMs and commercial assistants lose 30%–60% accuracy when forced to read full 115k token histories instead of oracle evidence, proving that naive long context is insufficient for robust memory.

BENCHMARK

By the Numbers

BENCHMARK

Pilot study of commercial memory augmented chat assistants on LongMemEval

Accuracy on LongMemEval style questions with short multi session histories and offline reading.

BENCHMARK

Long context LLM performance on LONGMEMEVALS with and without Chain of Note

Accuracy on LONGMEMEVALS under full history reading vs oracle evidence sessions.

KEY INSIGHT

The Counterintuitive Finding

LongMemEval shows that GPT-4o drops from 0.924 accuracy with oracle evidence sessions to 0.640 on LONGMEMEVALS when reading full 115k token histories.

This is surprising because long context LLMs are marketed as solving context length issues, yet LongMemEval reveals they still lose around 30% accuracy purely from context sprawl.

WHY IT MATTERS

What this unlocks for the field

LongMemEval unlocks a realistic way to stress test long term interactive memory, including temporal reasoning, knowledge updates, and abstention, over histories up to 1.5M tokens.

Builders can now benchmark and iterate memory systems with explicit control over value granularity, fact augmented keys, and time aware queries, rather than relying on opaque long context behavior.

~14 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…