Unable to Forget: Proactive Interference Reveals Working Memory Limits in LLMs Beyond Context Length

AuthorsChupei Wang, Jiaqiu Vince Sun

2025

TL;DR

PI-LLM shows that retrieval in long contexts is limited by a unified interference bottleneck, with accuracy dropping log-linearly toward zero as updates, keys, or value length increase.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

LLMs Confuse Overwritten Facts Even When the Correct Value Is Last

PI-LLM shows retrieval accuracy declines log-linearly toward zero as update count per key increases from 3 to 400, despite targets appearing just before the query.

Under this synthetic key–value tracking task, many state-of-the-art LLMs return outdated values, revealing proactive interference and a working memory bottleneck beyond context length.

HOW IT WORKS

PI-LLM — Proactive Interference as a Working Memory Probe

PI-LLM centers on a synthetic key–value retrieval task, the Interference Endurance Score (IES), and explicit per-key forget prompts plus a mock QA reset intervention.

You can think of PI-LLM as stress-testing RAM: repeatedly overwriting the same variables and measuring how often the system still reads stale values.

This design lets PI-LLM expose a unified interference bottleneck that a plain context window cannot reveal, dissociating storage capacity from flexible, selective retrieval.

DIAGRAM

Key–Value Update and Query Flow in PI-LLM

This diagram shows how PI-LLM streams key–value updates, then queries only the final values to isolate proactive interference from search difficulty.

DIAGRAM

Evaluation Pipeline for Interference Manipulations

This diagram shows how PI-LLM varies update count, number of keys, and value length, then computes Interference Endurance Score across models.

PROCESS

How PI-LLM Handles a Proactive Interference Retrieval Task

01
Experimental Design
PI-LLM defines the synthetic key–value retrieval task and the Interference Endurance Score (IES) to probe working-memory-like behavior under interference.
02
Interference Dominates Retrieval Despite Recency and Instructions
PI-LLM varies update count per key with 46 unique keys, showing log-linear accuracy decline and proactive interference-driven errors.
03
Interference Is Independent of Input Length
PI-LLM manipulates Updated Keys and Tracked Keys at fixed input length, demonstrating that interference, not context size, drives failures.
04
Mitigating Interference: Empirical Insights from LLM–Human Comparison
PI-LLM tests per-key forget prompts and the mock QA reset, revealing limited gains from natural language instructions and partial relief from hacked gating.

KEY CONTRIBUTIONS

Key Contributions

01
Interference Dominates Retrieval Despite Recency and Instructions
PI-LLM shows retrieval accuracy declines approximately log-linearly as update count per key increases from 3 to 400 with 46 keys, and errors mainly retrieve overwritten values.
02
Interference Is Independent of Input Length
PI-LLM demonstrates similar log-linear declines when varying Updated Keys and Tracked Keys under fixed-length inputs, revealing a unified interference limit beyond context window size.
03
Retrieval Capacity Is Limited by a Single Interference Bottleneck Across Dimensions
PI-LLM increases value length from 1 to 40 items and finds accuracy falls below 5%, indicating a shared anti-interference resource across updates, keys, and value length.

RESULTS

By the Numbers

Accuracy vs update count

0% at 400 updates per key for most M models

L models retain substantially higher accuracy at 400 updates per key

IES vs size class

R² = 0.261 for regression with size and context

parameter size class t = 3.03, p = 0.005; context length t = −0.144, p = 0.886

Size IES correlation

ρ = 0.673, p = 0.00158

computed for 19 models with 128k–131k context length

Value length effect

< 5% accuracy at 40 items

all models drop below 40% accuracy by 10 items value length

PI-LLM evaluates 0.6B–637B-parameter LLMs on a synthetic key–value retrieval benchmark with proactive interference manipulations, showing universal log-linear decay and that Interference Endurance Score tracks parameter size, not context window.

BENCHMARK

By the Numbers

BENCHMARK

Correlation Between Model Size Class and Interference Endurance Score

Interference Endurance Score (IES) across XS, S, M, and L size classes for models with 128k–131k context length.

KEY INSIGHT

The Counterintuitive Finding

PI-LLM finds that context window length has no significant effect on Interference Endurance Score (t = −0.144, p = 0.886), while parameter size strongly correlates.

This breaks the common assumption that simply extending context windows solves long-context retrieval, showing interference resistance is a separate, capacity-like resource.

WHY IT MATTERS

What this unlocks for the field

PI-LLM provides a cognitively grounded way to quantify working-memory-like limits in LLMs via interference curves and the Interference Endurance Score.

Armed with PI-LLM, builders can design and benchmark architectures or memory mechanisms that explicitly target interference suppression, rather than just increasing context length.

~14 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…