Breaking the KV Cache Bottleneck: Fan Duality Model Achieves O(1) Decode Memory with Superior Associative Recall

AuthorsYasong Fan

2026

TL;DR

Fan Duality Model (FDM) combines a phase-preserving Fan Operator with a local-global cache and Freeze-Scan training to reach 0.966 MQAR accuracy and fixed 867 MB O(1) decode memory.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

KV cache memory explodes to 4,247 MB at 8k tokens

Transformers need 4,247 MB KV cache at N=8,192 for a 137M model, while Fan Duality Model (FDM) stays at 867 MB.

This KV cache growth makes long-context decode memory and decode speed impractical, degrading Transformer throughput by 83% and limiting associative recall tasks.

HOW IT WORKS

Fan Duality Model — wave scan plus particle cache

Fan Duality Model (FDM) combines a Fan Operator, Local-Global Cache, Freeze-Scan Training, and Holographic Reference Beam Decoding to decouple wave and particle behavior.

You can think of Fan Duality Model (FDM) like RAM plus an indexed card catalog: the wave state is compact RAM, while the cache is a small, addressable index.

This dual design lets Fan Duality Model (FDM) keep O(1) memory while still doing precise associative recall that a plain context window or pure scan cannot.

DIAGRAM

Freeze Scan training pipeline for Fan Duality Model

This diagram shows how Fan Duality Model (FDM) alternates between full training and cache-only optimization in the Freeze-Scan strategy.

DIAGRAM

Evaluation pipeline for language modeling and MQAR

This diagram shows how Fan Duality Model (FDM) is evaluated on WikiText-103, MQAR, and downstream benchmarks against Transformer baselines.

PROCESS

How Fan Duality Model Handles a sequence modeling task

  1. 01

    The Fan Operator

    Fan Duality Model (FDM) first applies the Fan Operator recurrent scan, using phase-preserving Givens rotations to update the complex hidden state ht from embeddings.

  2. 02

    Local Global Cache

    Fan Duality Model (FDM) populates the Local Global Cache with W=256 local and K=16 global slots, selected by seff based associative addressing.

  3. 03

    Freeze Scan Training

    Fan Duality Model (FDM) runs Freeze-Scan Training, first training all parameters, then freezing Φwave and specializing Φcache and embeddings for induction style recall.

  4. 04

    Holographic Reference Beam Decoding

    Fan Duality Model (FDM) finally uses Holographic Reference Beam Decoding to modulate ht with xt, with 4-head orthogonal beams improving PPL by 2.13 points.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    FDM architecture

    Fan Duality Model (FDM) introduces a Fan Operator wave scan plus Local Global Cache particle component, achieving O(1) decode memory of 867 MB with W=256 and K=16 slots.

  • 02

    Freeze Scan

    Fan Duality Model (FDM) uses Freeze-Scan Training to avoid gradient sinks, improving WikiText-103 perplexity from 487 to 64.9 in 44K steps and crossing PPL 100 at 17K steps.

  • 03

    Holographic Reference Beam Decoding

    Fan Duality Model (FDM) adds Holographic Reference Beam Decoding, where a 4-head orthogonal reference beam reduces PPL by 2.13 points to 62.79 with only 1.3M extra parameters.

RESULTS

By the Numbers

Val PPL

64.9

-423.1 vs FDM — Full fine-tuning (487)

Val PPL

62.79

-2.13 vs FDM — Freeze-Scan (64.9)

MQAR accuracy

0.966

+0.360 over Transformer

Decode Memory (MB)

867

-3,380 MB vs Transformer at N=8,192

On WikiText-103 and MQAR, Fan Duality Model (FDM) trades some language modeling PPL for dramatically better associative recall and O(1) decode memory. These results show Fan Duality Model (FDM) can reach 0.966 MQAR accuracy while keeping decode memory fixed at 867 MB across 128–8,192 token prompts.

BENCHMARK

By the Numbers

On WikiText-103 and MQAR, Fan Duality Model (FDM) trades some language modeling PPL for dramatically better associative recall and O(1) decode memory. These results show Fan Duality Model (FDM) can reach 0.966 MQAR accuracy while keeping decode memory fixed at 867 MB across 128–8,192 token prompts.

BENCHMARK

Table 3: MQAR accuracy (Easy: seq=64, 8 KV pairs)

Accuracy on Multi Query Associative Recall Easy setting.

BENCHMARK

Table 1: WikiText-103 validation perplexity

Validation perplexity on WikiText-103 for Fan Duality Model variants and Transformer.

KEY INSIGHT

The Counterintuitive Finding

Fan Duality Model (FDM) with pure scan (K=0) scores only 0.011 MQAR accuracy, while adding just K=16 cache slots jumps to 0.966.

This is surprising because linear recurrent models are often assumed to store long range information in their hidden state, yet Fan Duality Model (FDM) shows that without a tiny particle cache associative recall nearly collapses.

WHY IT MATTERS

What this unlocks for the field

Fan Duality Model (FDM) unlocks fixed 867 MB decode memory and stable decode speed across 128–8,192 tokens while still achieving strong associative recall.

Builders can now design long context systems where memory cost is O(1) in sequence length, yet Fan Duality Model (FDM) still retrieves specific tokens via a tiny learned cache and holographic decoding.

~13 min read← Back to papers

Related papers

Memory Architecture

Codebase-Memory: Tree-Sitter-Based Knowledge Graphs for LLM Code Exploration via MCP

Martin Vogel, Falk Meyer-Eschenbach et al.

· 2026

Codebase-Memory parses repositories with a multi-pass pipeline using the Parse stage, Build stage, Serve stage, FunctionRegistry, Louvain communities, and MCP tool interface to build a persistent SQLite knowledge graph. On a 31-language benchmark, Codebase-Memory reaches 0.83 quality versus 0.92 for an Explorer Agent while using ten times fewer tokens and 2.1 times fewer tool calls.

Questions about this paper?

Paper: Breaking the KV Cache Bottleneck: Fan Duality Model Achieves O(1) Decode Memory with Superior Associative Recall

Answers use this explainer on Memory Papers.

Checking…