Improving Factuality with Explicit Working Memory

AuthorsMingda Chen, Yang Li, Karthik Padthe et al.

2024

TL;DR

Ewe uses explicit working memory over KV-cache memories with real-time retrieval and fact-checking feedback to raise VeriScore F1 by up to 12.6 points over Llama-3.1 70B on Biography.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Long-form fact-seeking answers still hallucinate despite RAG

LLMs still generate factually inaccurate content in long-form answers, even with RAG; Ewe targets hallucinations in knowledge-intensive generation.

On four fact-seeking datasets, Llama-3.1 70B with retrieval augmentation only reaches 37.1–72.7 VeriScore F1, leaving many unsupported claims and unreliable responses.

HOW IT WORKS

Ewe — Explicit Working Memory with real-time feedback

Ewe’s core mechanism combines working memory, real-time feedback, fact-checking outcomes, and relevant knowledge stored as KV-cache memory units updated during generation.

You can think of Ewe like CPU RAM that is continuously refreshed: retrieval and fact-checkers stream in new “pages” while old, less relevant ones are evicted via FIFO.

This explicit working memory lets Ewe correct sentences mid-generation and reuse past KV caches, enabling dynamic factual updates that a fixed context window cannot provide.

DIAGRAM

Sentence-level feedback loop in Ewe

This diagram shows how Ewe pauses at sentence boundaries to retrieve passages, run VeriScore-style fact-checking, and then refresh working memory before regenerating.

DIAGRAM

Evaluation pipeline and ablations for Ewe

This diagram shows how Ewe is evaluated on four datasets, with VeriScore and AlpacaEval plus memory configuration ablations.

PROCESS

How Ewe Handles a fact-seeking long-form generation

01
Real-time Feedback
Ewe periodically pauses decoding when a new sentence is generated and triggers real-time feedback from retrieval and fact-checking modules to assess factuality.
02
Fact-checking Outcomes
Using VeriScore-style claim extraction and verification, Ewe collects fact-checking outcomes that refute inaccurate claims and provide corrected factual statements.
03
Relevant Knowledge
Ewe queries Contriever over C4 and Wikipedia to gather relevant knowledge passages whose retrieval scores exceed a threshold before encoding them into memory units.
04
Refreshing Working Memories
Through refreshing working memories, Ewe encodes new feedback as KV caches, updates FIFO memory units, deletes incorrect sentences, and regenerates with improved factual context.

KEY CONTRIBUTIONS

Key Contributions

01
Explicit Working Memory for factuality
Ewe introduces an explicit working memory of KV-cache units that store fact-checking outcomes and relevant knowledge, enabling mid-generation corrections without reprocessing all context.
02
Real-time feedback integration
Ewe integrates real-time feedback from VeriScore-based fact-checkers and Contriever retrieval, refreshing memories at Tr = 1 and Tv = 8 timesteps during decoding.
03
Improved VeriScore on four datasets
Ewe improves VeriScore F1 by 2 to 6 points absolute over strong baselines, reaching 75.9 on LongFact and 49.7 on Biography with Llama-3.1 70B.

RESULTS

By the Numbers

VeriScore F1 LongFact

75.9

+3.2 over Llama-3.1 70B +RA

VeriScore F1 Biography

49.7

+12.6 over Llama-3.1 70B

AlpacaEval WR LongFact

50.1

+8.9 over Llama-3.1 70B +RA

Cohen Kappa VeriScore

0.65

+0.04 over Retrieval Augmentation

On LongFact, Fava, AlpacaFact, and Biography, which test fact-seeking long-form generation, Ewe raises VeriScore F1 by 2–6 points absolute and improves agreement with human factuality judgments while keeping AlpacaEval win rates roughly on par with Llama-3.1 70B.

BENCHMARK

By the Numbers

BENCHMARK

Table 1: VeriScore F1 on Biography with Llama-3.1 70B

VeriScore F1 on Biography comparing Ewe against Llama-3.1 70B and three retrieval-based baselines.

KEY INSIGHT

The Counterintuitive Finding

Increasing the number of working memory units actually reduces VeriScore precision, even though more memory should intuitively help factuality.

This is surprising because developers often assume larger memories are always better, but Ewe shows stale information in many units can hurt factual correctness despite unchanged recall.

WHY IT MATTERS

What this unlocks for the field

Ewe unlocks dynamic, sentence-level factual correction in long-form answers by combining explicit working memory with real-time retrieval and fact-checking.

Builders can now design agents that stream answers, pause to verify specific claims, surgically regenerate only incorrect spans, and reuse KV caches for efficiency instead of re-running full RAG pipelines.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…