TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

AuthorsAmir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni

2025

TL;DR

TurboQuant uses random rotations plus optimal scalar Lloyd–Max codebooks and a QJL residual stage to reach MSE within ≈2.7× of Shannon’s lower bound while keeping inner-product estimates unbiased.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Vector quantizers miss Shannon-optimal distortion rates at low bit-widths

Existing vector quantization methods fail to achieve optimal distortion rates across bit-widths and dimensions, especially for MSE and inner product distortion.

This breaks online tasks like KV cache quantization and nearest neighbor search, where suboptimal distortion directly increases latency, memory, and quality degradation.

HOW IT WORKS

TurboQuant: Random rotations plus scalar Lloyd-Max and QJL residuals

TurboQuant uses MSE Optimized TurboQuant, Inner Product TurboQuant, Random Rotation Matrix Π, and Lloyd-Max Quantizer to build data-oblivious, per-coordinate scalar quantizers.

You can think of TurboQuant like shuffling data into an isotropic frame, compressing each coordinate with a finely tuned ruler, then adding a 1-bit sketch like a checksum for inner products.

This design lets TurboQuant reach near-Shannon distortion for MSE while the QJL residual stage keeps inner-product estimates unbiased, something a plain context window or naive quantizer cannot do.

DIAGRAM

Online quantization and dequantization flow in TurboQuant

This diagram shows how TurboQuant processes a vector through online quantization and dequantization for both MSE and inner-product objectives.

DIAGRAM

Evaluation pipeline for KV cache and nearest neighbor experiments

This diagram shows how TurboQuant is evaluated on empirical distortion, KV cache compression, and nearest neighbor search.

PROCESS

How TurboQuant Handles a Quantization Session

  1. 01

    MSE Optimal TurboQuant

    TurboQuant applies a Random Rotation Matrix Π to map vectors onto Sd−1, then uses Lloyd-Max Quantizer per coordinate to minimize MSE.

  2. 02

    Inner Product TurboQuant

    TurboQuant runs MSE Optimal TurboQuant at bit-width b−1, computes the residual, and prepares it for unbiased inner-product handling.

  3. 03

    QJL: 1-bit inner product quantization

    TurboQuant feeds the residual into QJL, using a Gaussian matrix S and sign sketch to get a 1-bit unbiased inner-product estimator.

  4. 04

    Lower Bound and Empirical Validation

    TurboQuant compares its distortion against Shannon Lower Bound and Yao-based lower bounds, then validates on KV cache and nearest neighbor tasks.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    MSE Optimized TurboQuant

    TurboQuant designs an MSE Optimal TurboQuant that achieves Dmse ≤ √(3π/2)·4−b and concrete distortions 0.36, 0.117, 0.03, 0.009 for b=1,2,3,4 using Random Rotation Matrix Π and Lloyd-Max Quantizer.

  • 02

    Inner Product TurboQuant

    TurboQuant introduces Inner Product TurboQuant, a two-stage scheme with QJL that is unbiased and satisfies Dprod ≤ √(3π2)·∥y∥2²/(d·4b), with 1.57/d at b=1.

  • 03

    Information-theoretic Lower Bounds

    TurboQuant proves lower bounds Dmse ≥ 4−b and Dprod ≥ ∥y∥2²/(d·4b), showing TurboQuant is within ≈2.7× of Shannon’s limit and much tighter at low bit-widths.

RESULTS

By the Numbers

Mean squared error Dmse

0.009

at b=4 vs Shannon lower bound 4−4=0.0625

Inner product error Dprod

1.57/d

at b=1 vs lower bound 1/(d·4)

KV cache compression

3.5 bits per channel

absolute quality neutrality vs full cache at 16 bits

Nearest neighbor recall

1@k near 1.0

with indexing time essentially zero vs PQ and RabitQ

TurboQuant is evaluated on DBpedia OpenAI3 embeddings, LongBench-E, and needle-in-a-haystack tests, showing near-lower-bound distortion and 4.5×+ KV compression without quality loss.

BENCHMARK

By the Numbers

TurboQuant is evaluated on DBpedia OpenAI3 embeddings, LongBench-E, and needle-in-a-haystack tests, showing near-lower-bound distortion and 4.5×+ KV compression without quality loss.

BENCHMARK

LongBench-V1 average scores for Llama-3.1-8B-Instruct with KV compression

Average LongBench-V1 score across SingleQA, MultiQA, Summarization, Few shot, Synthetic, and Code for different KV cache schemes.

KEY INSIGHT

The Counterintuitive Finding

TurboQuant achieves absolute quality neutrality on LongBench-V1 at just 3.5 bits per channel, matching a full 16-bit KV cache average score of 50.06.

This is surprising because conventional wisdom suggests aggressive KV cache quantization should noticeably degrade long-context reasoning, yet TurboQuant preserves performance even at over 4.5× compression.

WHY IT MATTERS

What this unlocks for the field

TurboQuant makes it practical to deploy long-context LLMs and high-dimensional vector databases with near-Shannon-optimal compression and unbiased inner-product queries.

Builders can now quantize KV caches and embedding indices online, at 2.5–3.5 bits per channel, while keeping recall and generation quality indistinguishable from full precision.

~14 min read← Back to papers

Related papers

SurveyAgent Memory

Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

Dongming Jiang, Yi Li et al.

arXiv 2026 · 2026

Anatomy of Agentic Memory organizes agentic memory into four structures using components like Lightweight Semantic Memory, Entity-Centric and Personalized Memory, Episodic and Reflective Memory, and Structured and Hierarchical Memory. Anatomy of Agentic Memory then reports comparative results such as Nemori’s 0.781 semantic judge score on LoCoMo versus SimpleMem’s 0.298, and latency differences like 1.129s for Nemori versus 32.372s for MemoryOS.

Survey

Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks

Zexue He, Yu Wang et al.

· 2026

MEMORYARENA orchestrates Memory-Agent-Environment Loops, Multi-Session Working Flow, Bundled Web Shopping, Group Travel Planning, and Progressive Web Search to stress-test how agents store and reuse information across sessions. MEMORYARENA’s main result is that agents with near-saturated scores on long-context benchmarks like LoCoMo still obtain Task Success Rates as low as 0.00–0.12 across its four environments.

Memory ArchitectureSurvey

Multi-Agent Memory from a Computer Architecture Perspective: Visions and Challenges Ahead

Zhongming Yu, Naicheng Yu et al.

arXiv 2026 · 2026

Multi-Agent Memory Architecture organizes Agent IO Layer, Agent Cache Layer, Agent Memory Layer, Agent Cache Sharing, and Agent Memory Access Protocol into a computer-architecture-style design for LLM agents. Multi-Agent Memory Architecture’s main result is a conceptual unification of shared and distributed memory plus a research agenda for multi-agent memory consistency instead of benchmark gains.

Questions about this paper?

Paper: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Answers use this explainer on Memory Papers.

Checking…