Learning to Forget: Sleep-Inspired Memory Consolidation for Resolving Proactive Interference in Large Language Models

AuthorsYing Xie

2026

TL;DR

SleepGate uses a sleep-inspired forgetting gate over the KV cache to cut proactive interference, reaching 99.5% retrieval accuracy at PI depth 5 vs ≤18% for all baselines.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Proactive interference makes KV caches unusable at depth

Wang and Sun show retrieval accuracy declines log linearly toward zero as superseded associations accumulate, even when targets sit next to the query.

On the PI-LLM task, stale values in the KV cache dominate attention, so LLMs repeatedly output overwritten answers, breaking long-horizon working memory.

HOW IT WORKS

SleepGate — sleep-inspired KV cache consolidation

SleepGate’s core mechanism combines a Conflict-Aware Temporal Tagger, Forgetting Gate, Consolidation Module, and Sleep Trigger to rewrite the KV cache during sleep micro-cycles.

Conceptually, SleepGate treats the KV cache like a brain during sleep, replaying and downscaling synapses so important memories survive while superseded traces fade.

This sleep-inspired gating lets SleepGate selectively suppress stale entries and compress related ones, something a plain context window and uniform attention cannot achieve.

DIAGRAM

Sleep micro-cycle inference flow

This diagram shows how SleepGate runs wake and sleep passes with soft attention biasing during inference.

DIAGRAM

Training loop and PI curriculum

This diagram shows how SleepGate is trained across stages with a PI depth curriculum and dual wake sleep losses.

PROCESS

How SleepGate Handles a PI-LLM Episode

  1. 01

    Conflict-Aware Temporal Tagger

    SleepGate uses the Conflict-Aware Temporal Tagger to attach timestamps, semantic signatures, superseded flags, and attention counts to each KV entry.

  2. 02

    Sleep Trigger Adaptive Scheduling

    SleepGate monitors attention entropy and conflict density, and the Sleep Trigger Adaptive Scheduling decides when to start a sleep micro-cycle over the tagged cache.

  3. 03

    Forgetting Gate

    During the sleep micro-cycle, the Forgetting Gate scores each tagged entry, learning retention scores that distinguish current from superseded associations.

  4. 04

    Consolidation Module

    The Consolidation Module clusters compressible entries using semantic signatures and merges them into compact summaries that preserve the most recent values.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    SleepGate framework for active KV cache management

    SleepGate maps synaptic downscaling, selective replay, and active forgetting into a Conflict-Aware Temporal Tagger, Forgetting Gate, and Consolidation Module operating on the KV cache.

  • 02

    Dual-phase wake sleep training objective

    SleepGate introduces a dual-phase objective combining wake language modeling loss, post consolidation sleep retrieval loss, compression loss, and gate alignment loss with λg set to 0.3.

  • 03

    Theoretical and empirical PI reduction

    SleepGate theoretically reduces the effective interference horizon from O(n) to O(log n) and empirically reaches 99.5% retrieval accuracy at PI depth 5 on the PI-LLM benchmark.

RESULTS

By the Numbers

Retrieval accuracy n=5

99.5%

+89.5 over StreamingLLM

Retrieval accuracy n=10

97.0%

+91.0 over StreamingLLM

Retrieval accuracy n=2

99.0%

+81.0 over StreamingLLM

Base transformer parameters

793,344

Sleep modules add 15.6% overhead

On the synthetic PI-LLM benchmark with depths 1–30, SleepGate is evaluated against Full KV Cache, Sliding Window, H2O, StreamingLLM, and Decay Only. The main result shows SleepGate maintains near perfect retrieval through PI depth 10 while all baselines remain below 18% accuracy across all depths.

BENCHMARK

By the Numbers

On the synthetic PI-LLM benchmark with depths 1–30, SleepGate is evaluated against Full KV Cache, Sliding Window, H2O, StreamingLLM, and Decay Only. The main result shows SleepGate maintains near perfect retrieval through PI depth 10 while all baselines remain below 18% accuracy across all depths.

BENCHMARK

Retrieval accuracy at PI depth n=5

Retrieval accuracy (%) on PI-LLM episodes with 5 prior superseding updates.

KEY INSIGHT

The Counterintuitive Finding

SleepGate reaches 99.5% retrieval accuracy at PI depth 5 and 97.0% at depth 10, while all five baselines stay below 18% everywhere.

This is surprising because simply changing KV cache management, without enlarging context, reverses the log linear accuracy collapse that seemed inherent to transformer attention.

WHY IT MATTERS

What this unlocks for the field

SleepGate unlocks content aware, sleep like forgetting inside the KV cache, letting transformers keep current facts accessible even after many conflicting updates.

Builders can now design long horizon, streaming systems where stale context is actively suppressed rather than just truncated, enabling reliable retrieval in settings where prompt engineering previously failed.

~13 min read← Back to papers

Related papers

Memory Architecture

Breaking the KV Cache Bottleneck: Fan Duality Model Achieves O(1) Decode Memory with Superior Associative Recall

Yasong Fan

· 2026

Fan Duality Model (FDM) uses the Fan Operator, Local-Global Cache, Freeze-Scan Training, and Holographic Reference Beam Decoding to separate wave-like compression from particle-like associative recall. On WikiText-103, Fan Duality Model (FDM) reaches 64.9 perplexity with Freeze-Scan and 62.79 with holographic decoding, while achieving 0.966 MQAR accuracy compared to Transformer at 0.606.

Questions about this paper?

Paper: Learning to Forget: Sleep-Inspired Memory Consolidation for Resolving Proactive Interference in Large Language Models

Answers use this explainer on Memory Papers.

Checking…