Breaking the KV Cache Bottleneck: Fan Duality Model Achieves O(1) Decode Memory with Superior Associative Recall

AuthorsYasong Fan

2026

TL;DR

Fan Duality Model (FDM) combines a phase-preserving Fan Operator with a local-global cache and Freeze-Scan training to reach 0.966 MQAR accuracy and fixed 867 MB O(1) decode memory.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

KV cache memory explodes to 4,247 MB at 8k tokens

Transformers need 4,247 MB KV cache at N=8,192 for a 137M model, while Fan Duality Model (FDM) stays at 867 MB.

This KV cache growth makes long-context decode memory and decode speed impractical, degrading Transformer throughput by 83% and limiting associative recall tasks.

HOW IT WORKS

Fan Duality Model — wave scan plus particle cache

Fan Duality Model (FDM) combines a Fan Operator, Local-Global Cache, Freeze-Scan Training, and Holographic Reference Beam Decoding to decouple wave and particle behavior.

You can think of Fan Duality Model (FDM) like RAM plus an indexed card catalog: the wave state is compact RAM, while the cache is a small, addressable index.

This dual design lets Fan Duality Model (FDM) keep O(1) memory while still doing precise associative recall that a plain context window or pure scan cannot.

DIAGRAM

Freeze Scan training pipeline for Fan Duality Model

This diagram shows how Fan Duality Model (FDM) alternates between full training and cache-only optimization in the Freeze-Scan strategy.

DIAGRAM

Evaluation pipeline for language modeling and MQAR

This diagram shows how Fan Duality Model (FDM) is evaluated on WikiText-103, MQAR, and downstream benchmarks against Transformer baselines.

PROCESS

How Fan Duality Model Handles a sequence modeling task

01
The Fan Operator
Fan Duality Model (FDM) first applies the Fan Operator recurrent scan, using phase-preserving Givens rotations to update the complex hidden state ht from embeddings.
02
Local Global Cache
Fan Duality Model (FDM) populates the Local Global Cache with W=256 local and K=16 global slots, selected by seff based associative addressing.
03
Freeze Scan Training
Fan Duality Model (FDM) runs Freeze-Scan Training, first training all parameters, then freezing Φwave and specializing Φcache and embeddings for induction style recall.
04
Holographic Reference Beam Decoding
Fan Duality Model (FDM) finally uses Holographic Reference Beam Decoding to modulate ht with xt, with 4-head orthogonal beams improving PPL by 2.13 points.

KEY CONTRIBUTIONS

Key Contributions

01
FDM architecture
Fan Duality Model (FDM) introduces a Fan Operator wave scan plus Local Global Cache particle component, achieving O(1) decode memory of 867 MB with W=256 and K=16 slots.
02
Freeze Scan
Fan Duality Model (FDM) uses Freeze-Scan Training to avoid gradient sinks, improving WikiText-103 perplexity from 487 to 64.9 in 44K steps and crossing PPL 100 at 17K steps.
03
Holographic Reference Beam Decoding
Fan Duality Model (FDM) adds Holographic Reference Beam Decoding, where a 4-head orthogonal reference beam reduces PPL by 2.13 points to 62.79 with only 1.3M extra parameters.

RESULTS

By the Numbers

Val PPL

64.9

-423.1 vs FDM — Full fine-tuning (487)

Val PPL

62.79

-2.13 vs FDM — Freeze-Scan (64.9)

MQAR accuracy

0.966

+0.360 over Transformer

Decode Memory (MB)

867

-3,380 MB vs Transformer at N=8,192

On WikiText-103 and MQAR, Fan Duality Model (FDM) trades some language modeling PPL for dramatically better associative recall and O(1) decode memory. These results show Fan Duality Model (FDM) can reach 0.966 MQAR accuracy while keeping decode memory fixed at 867 MB across 128–8,192 token prompts.

BENCHMARK

By the Numbers

BENCHMARK

Table 3: MQAR accuracy (Easy: seq=64, 8 KV pairs)

Accuracy on Multi Query Associative Recall Easy setting.

BENCHMARK

Table 1: WikiText-103 validation perplexity

Validation perplexity on WikiText-103 for Fan Duality Model variants and Transformer.

KEY INSIGHT

The Counterintuitive Finding

Fan Duality Model (FDM) with pure scan (K=0) scores only 0.011 MQAR accuracy, while adding just K=16 cache slots jumps to 0.966.

This is surprising because linear recurrent models are often assumed to store long range information in their hidden state, yet Fan Duality Model (FDM) shows that without a tiny particle cache associative recall nearly collapses.

WHY IT MATTERS

What this unlocks for the field

Fan Duality Model (FDM) unlocks fixed 867 MB decode memory and stable decode speed across 128–8,192 tokens while still achieving strong associative recall.

Builders can now design long context systems where memory cost is O(1) in sequence length, yet Fan Duality Model (FDM) still retrieves specific tokens via a tiny learned cache and holographic decoding.

~13 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…