Evaluating Long-Term Memory for Long-Context Question Answering

AuthorsAlessandra Terranova, Björn Ross, Alexandra Birch

2025

TL;DR

Evaluating Long-Term Memory for Long-Context Question Answering shows that combining RAG semantic memory with episodic in-context reflections cuts tokens by over 90% while matching or beating Full Context F1 on LoCoMo.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Long-context agents waste 23k tokens per query and still fail adversarial QA

Evaluating Long-Term Memory for Long-Context Question Answering shows Full Context prompting uses about 23,000 tokens per query on LoCoMo while still missing adversarial questions.

On long conversational QA, this bloated context makes LLM agents inefficient, vulnerable to lost in the middle, and unable to reliably answer multi-hop and adversarial questions.

HOW IT WORKS

Semantic, episodic, and procedural memory under one evaluation framework

Evaluating Long-Term Memory for Long-Context Question Answering contrasts Full Context, RAG, A-Mem, RAG+PromptOpt, and RAG+EpMem as interchangeable memory modules for long-context QA.

You can think of RAG and A-Mem as a shared knowledge disk, episodic reflections as a diary of past mistakes, and prompt optimization as updating the agent’s procedural manual.

This unified setup lets Evaluating Long-Term Memory for Long-Context Question Answering test which memory mix best replaces a huge context window while preserving reasoning over temporal, multi-hop, and adversarial questions.

DIAGRAM

Episodic memory through in-context learning for QA

This diagram shows how Evaluating Long-Term Memory for Long-Context Question Answering builds and uses EpMem by reflecting on QA performance and retrieving similar past experiences as in-context examples.

DIAGRAM

Evaluation pipeline across LoCoMo and QMSum

This diagram shows how Evaluating Long-Term Memory for Long-Context Question Answering runs each memory strategy on LoCoMo and QMSum, measures F1 and tokens, and computes average rankings.

PROCESS

How Evaluating Long-Term Memory for Long-Context Question Answering Handles a LoCoMo QA Query

  1. 01

    Dataset and evaluation

    Evaluating Long-Term Memory for Long-Context Question Answering takes a LoCoMo question, its long conversation, and reasoning label, then uses F1 and token count as metrics.

  2. 02

    RAG semantic memory

    Evaluating Long-Term Memory for Long-Context Question Answering embeds the conversation with bge m3 and retrieves top 10 utterances plus timestamps as semantic memory.

  3. 03

    EpMem episodic memory

    Evaluating Long-Term Memory for Long-Context Question Answering retrieves top 3 past question prediction label reflection tuples from episodic memory as in-context examples.

  4. 04

    PromptOpt procedural memory

    Evaluating Long-Term Memory for Long-Context Question Answering optionally updates the QA prompt using LangMem style classification and optimisation over batches of 5 examples.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Systematic evaluation of memory strategies

    Evaluating Long-Term Memory for Long-Context Question Answering compares Full Context, RAG, A-Mem, RAG+PromptOpt, RAG+EpMem, and RAG+PromptOpt+EpMem across five LoCoMo reasoning categories.

  • 02

    Token efficient long-context QA

    Evaluating Long-Term Memory for Long-Context Question Answering shows RAG based memory reduces tokens by over 90%, from about 23,000 to 600–1,500 tokens per query, with competitive F1.

  • 03

    Episodic memory for metacognition

    Evaluating Long-Term Memory for Long-Context Question Answering demonstrates RAG+EpMem reaches F1 rankings of 1.83 for Llama Instruct and 1.80 for GPT 4o mini while helping models recognise knowledge limits.

RESULTS

By the Numbers

Average F1 ranking

1.83

-2.67 ranks vs RAG on Llama 3.2-3B Instruct (4.50)

Average tokens per query

969.26

-22163.23 tokens vs Full Context on GPT 4o mini (23132.49)

Adversarial F1 GPT

77.64

-6.64 vs RAG on GPT 4o mini (84.28) while improving Multi Hop by +8.98

QMSum F1 GPT

27.70

+7.67 over RAG on GPT 4o mini (20.03) when using Full Context

Evaluating Long-Term Memory for Long-Context Question Answering reports F1 and token usage on LoCoMo and QMSum, showing that RAG+EpMem matches or beats Full Context rankings with over 90% fewer tokens. This proves memory augmented QA can replace massive context windows for long conversational reasoning.

BENCHMARK

By the Numbers

Evaluating Long-Term Memory for Long-Context Question Answering reports F1 and token usage on LoCoMo and QMSum, showing that RAG+EpMem matches or beats Full Context rankings with over 90% fewer tokens. This proves memory augmented QA can replace massive context windows for long conversational reasoning.

BENCHMARK

Performance comparison of memory augmentation approaches across instruction-tuned language models on LoCoMo

Average F1 ranking across categories for Llama 3.2-3B Instruct on LoCoMo.

KEY INSIGHT

The Counterintuitive Finding

Evaluating Long-Term Memory for Long-Context Question Answering finds that RAG+EpMem uses around 969.26 tokens per query on GPT 4o mini versus 23132.49 for Full Context.

Despite this 23x reduction, RAG+EpMem achieves a better average F1 ranking of 1.80 compared to 3.40 for Full Context, contradicting the assumption that more context always helps.

WHY IT MATTERS

What this unlocks for the field

Evaluating Long-Term Memory for Long-Context Question Answering shows that small open weight models can handle 9,000 token conversations using compact semantic and episodic memories.

Builders can now deploy long-term conversational agents with LoCoMo scale histories on resource constrained hardware without paying 20,000 token prompts for every question.

~13 min read← Back to papers

Related papers

RAG

A Dynamic Retrieval-Augmented Generation System with Selective Memory and Remembrance

Okan Bursa

· 2026

Adaptive RAG Memory (ARM) augments a standard retriever–generator stack with a Dynamic Embedding Layer and Remembrance Engine that track usage statistics and apply selective remembrance and decay to embeddings. On a lightweight retrieval benchmark, ARM achieves NDCG@5 ≈ 0.9401 and Recall@5 = 1.000 with 22M parameters, matching larger baselines like gte-small while providing the best efficiency among ultra-efficient models.

RAGLong-Term Memory

HingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable Dialogues

Yijie Zhong, Yunfan Gao, Haofen Wang

· 2026

HingeMem combines Boundary Guided Long-Term Memory, Dialogue Boundary Extraction, Memory Construction, Query Adaptive Retrieval, Hyperedge Rerank, and Adaptive Stop to segment dialogues into element-indexed hyperedges and plan query-specific retrieval. On LOCOMO, HingeMem achieves 63.9 overall F1 and 75.1 LLM-as-a-Judge score, surpassing the best baseline Zep (56.9 F1) by 7.0 F1 without using category-specific QA formats.

Questions about this paper?

Paper: Evaluating Long-Term Memory for Long-Context Question Answering

Answers use this explainer on Memory Papers.

Checking…