Benchmark Benchmark Cognitive Architecture

Learning to Forget: Sleep-Inspired Memory Consolidation for Resolving Proactive Interference in Large Language Models

AuthorsYing Xie

2026

TL;DR

SleepGate uses a sleep-inspired forgetting gate over the KV cache to cut proactive interference, reaching 99.5% retrieval accuracy at PI depth 5 vs ≤18% for all baselines.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Proactive interference makes KV caches unusable at depth

Wang and Sun show retrieval accuracy declines log linearly toward zero as superseded associations accumulate, even when targets sit next to the query.

On the PI-LLM task, stale values in the KV cache dominate attention, so LLMs repeatedly output overwritten answers, breaking long-horizon working memory.

HOW IT WORKS

SleepGate — sleep-inspired KV cache consolidation

SleepGate’s core mechanism combines a Conflict-Aware Temporal Tagger, Forgetting Gate, Consolidation Module, and Sleep Trigger to rewrite the KV cache during sleep micro-cycles.

Conceptually, SleepGate treats the KV cache like a brain during sleep, replaying and downscaling synapses so important memories survive while superseded traces fade.

This sleep-inspired gating lets SleepGate selectively suppress stale entries and compress related ones, something a plain context window and uniform attention cannot achieve.

DIAGRAM

Sleep micro-cycle inference flow

This diagram shows how SleepGate runs wake and sleep passes with soft attention biasing during inference.

DIAGRAM

Training loop and PI curriculum

This diagram shows how SleepGate is trained across stages with a PI depth curriculum and dual wake sleep losses.

PROCESS

How SleepGate Handles a PI-LLM Episode

01
Conflict-Aware Temporal Tagger
SleepGate uses the Conflict-Aware Temporal Tagger to attach timestamps, semantic signatures, superseded flags, and attention counts to each KV entry.
02
Sleep Trigger Adaptive Scheduling
SleepGate monitors attention entropy and conflict density, and the Sleep Trigger Adaptive Scheduling decides when to start a sleep micro-cycle over the tagged cache.
03
Forgetting Gate
During the sleep micro-cycle, the Forgetting Gate scores each tagged entry, learning retention scores that distinguish current from superseded associations.
04
Consolidation Module
The Consolidation Module clusters compressible entries using semantic signatures and merges them into compact summaries that preserve the most recent values.

KEY CONTRIBUTIONS

Key Contributions

01
SleepGate framework for active KV cache management
SleepGate maps synaptic downscaling, selective replay, and active forgetting into a Conflict-Aware Temporal Tagger, Forgetting Gate, and Consolidation Module operating on the KV cache.
02
Dual-phase wake sleep training objective
SleepGate introduces a dual-phase objective combining wake language modeling loss, post consolidation sleep retrieval loss, compression loss, and gate alignment loss with λg set to 0.3.
03
Theoretical and empirical PI reduction
SleepGate theoretically reduces the effective interference horizon from O(n) to O(log n) and empirically reaches 99.5% retrieval accuracy at PI depth 5 on the PI-LLM benchmark.

RESULTS

By the Numbers

Retrieval accuracy n=5

99.5%

+89.5 over StreamingLLM

Retrieval accuracy n=10

97.0%

+91.0 over StreamingLLM

Retrieval accuracy n=2

99.0%

+81.0 over StreamingLLM

Base transformer parameters

793,344

Sleep modules add 15.6% overhead

On the synthetic PI-LLM benchmark with depths 1–30, SleepGate is evaluated against Full KV Cache, Sliding Window, H2O, StreamingLLM, and Decay Only. The main result shows SleepGate maintains near perfect retrieval through PI depth 10 while all baselines remain below 18% accuracy across all depths.

BENCHMARK

By the Numbers

BENCHMARK

Retrieval accuracy at PI depth n=5

Retrieval accuracy (%) on PI-LLM episodes with 5 prior superseding updates.

KEY INSIGHT

The Counterintuitive Finding

SleepGate reaches 99.5% retrieval accuracy at PI depth 5 and 97.0% at depth 10, while all five baselines stay below 18% everywhere.

This is surprising because simply changing KV cache management, without enlarging context, reverses the log linear accuracy collapse that seemed inherent to transformer attention.

WHY IT MATTERS

What this unlocks for the field

SleepGate unlocks content aware, sleep like forgetting inside the KV cache, letting transformers keep current facts accessible even after many conflicting updates.

Builders can now design long horizon, streaming systems where stale context is actively suppressed rather than just truncated, enabling reliable retrieval in settings where prompt engineering previously failed.

~13 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…