Large Language Models with Controllable Working Memory

AuthorsDaliang Li, Ankit Singh Rawat, Manzil Zaheer et al.

2022

TL;DR

Knowledge Aware Finetuning (KAFT) injects counterfactual and irrelevant contexts so LLM working memory follows context > parametric knowledge > noise, yielding up to 24× controllability and 6× robustness gains.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

LLMs Ignore Context And Follow Wrong World Knowledge

KAFT targets the failure where large T5 and PaLM models "tend to igore a context when it contradicts with model’s world knowledge" and this "becomes more severe" as size grows.

When QA systems like T5 XXL and PaLM 540B rely on outdated parametric facts, downstream applications cannot reliably correct answers via context, and retrieval noise can further mislead predictions.

HOW IT WORKS

Knowledge Aware Finetuning (KAFT) — Controllable Working Memory via Context Priority

KAFT’s core mechanism is to mix relevant context, irrelevant context, counterfactual context, and empty context and label irrelevant slices with the pretrained model’s answer to enforce the priority relevant context > model’s pretrained world knowledge > irrelevant context.

You can think of KAFT as training the LLM’s activations like a working RAM buffer that must obey a priority schedule, while the frozen weights behave like long term disk storage.

This priority training lets KAFT override entrenched parametric facts when context disagrees, yet ignore distracting passages, something a plain context window without tailored finetuning cannot guarantee.

DIAGRAM

KAFT Training Data Flow Across Context Types

This diagram shows how KAFT constructs and labels training examples for relevant, counterfactual, irrelevant, and empty contexts using multiple QA datasets.

DIAGRAM

Evaluation Pipeline for Controllability and Robustness

This diagram shows how KAFT evaluates controllability on counterfactual TriviaQA heads and robustness on SQuAD 2.0 impossible questions against multiple baselines.

PROCESS

How KAFT Handles a Question Answering Task

01
Probing pretrained knowledge with bulk inference
KAFT first runs bulk few shot prompts to obtain the pretrained model’s answer M(q), which seeds labels for irrelevant and empty context examples.
02
Relevant context construction
KAFT builds logically entailing relevant context from SQuAD 2.0, T-REx, QASC, and filtered TriviaQA, sometimes mixing gold and sampled statements.
03
Counterfactuals
KAFT uses a T5 XXL prompt to generate plausible counterfactual answers and replaces answer entities in context to create counterfactual context slices.
04
Irrelevant Context
KAFT samples easy and hard irrelevant contexts, labels them with the pretrained model’s answer, and trains the model to ignore these when predicting.

KEY CONTRIBUTIONS

Key Contributions

01
Joint study of controllability and robustness
KAFT defines working memory controllability and robustness for LLMs, and shows that standard T5 and PaLM finetuning can worsen both as model size increases.
02
Knowledge aware finetuning (KAFT)
KAFT enforces the order relevant context > model’s pretrained world knowledge > irrelevant context by combining counterfactual context and irrelevant context labeled with the pretrained model’s answer.
03
Comprehensive empirical evaluation
KAFT demonstrates up to 24× controllability gains and up to 6× robustness gains over Noisy Finetuning on PaLM 540B, while matching performance on regular TriviaQA validation.

RESULTS

By the Numbers

Controllability rate

24× over Noisy Finetuning

+23× over Noisy Finetuning on PaLM 540B controllability

Robustness rate

6× over Noisy Finetuning

+5× over Noisy Finetuning on PaLM 540B robustness

Closed book match

0.7% counterfactual matches

KAFT PaLM 540B vs 0.6% pretrained shows little extra memorization

TriviaQA controllability

69% head counterfactual

+32 percentage points over Noisy Finetuning T5 XXL (37%)

These numbers come from KAFT’s TriviaQA counterfactual controllability and SQuAD 2.0 robustness evaluations, showing that KAFT can strongly prioritize context without sacrificing standard QA accuracy.

BENCHMARK

By the Numbers

BENCHMARK

Effect of Context Noise on Controllability (Table 7)

Controllability on head counterfactual TriviaQA questions for T5 XXL under different finetuning strategies.

KEY INSIGHT

The Counterintuitive Finding

KAFT shows that larger pretrained and QA finetuned LLMs become less controllable, increasingly ignoring context that contradicts their world knowledge.

This is surprising because practitioners often assume bigger models will naturally use context better, but KAFT reveals that scaling alone can entrench parametric knowledge against contextual corrections.

WHY IT MATTERS

What this unlocks for the field

KAFT unlocks LLMs whose working memory can be reliably steered by context while remaining robust to irrelevant or noisy passages.

Builders can now update or correct factual behavior via contextual statements instead of expensive weight editing or retraining, enabling more maintainable, retrieval grounded systems.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…