TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

AuthorsAmir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni

2025

TL;DR

TurboQuant uses random rotations plus optimal scalar Lloyd–Max codebooks and a QJL residual stage to reach MSE within ≈2.7× of Shannon’s lower bound while keeping inner-product estimates unbiased.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Vector quantizers miss Shannon-optimal distortion rates at low bit-widths

Existing vector quantization methods fail to achieve optimal distortion rates across bit-widths and dimensions, especially for MSE and inner product distortion.

This breaks online tasks like KV cache quantization and nearest neighbor search, where suboptimal distortion directly increases latency, memory, and quality degradation.

HOW IT WORKS

TurboQuant: Random rotations plus scalar Lloyd-Max and QJL residuals

TurboQuant uses MSE Optimized TurboQuant, Inner Product TurboQuant, Random Rotation Matrix Π, and Lloyd-Max Quantizer to build data-oblivious, per-coordinate scalar quantizers.

You can think of TurboQuant like shuffling data into an isotropic frame, compressing each coordinate with a finely tuned ruler, then adding a 1-bit sketch like a checksum for inner products.

This design lets TurboQuant reach near-Shannon distortion for MSE while the QJL residual stage keeps inner-product estimates unbiased, something a plain context window or naive quantizer cannot do.

DIAGRAM

Online quantization and dequantization flow in TurboQuant

This diagram shows how TurboQuant processes a vector through online quantization and dequantization for both MSE and inner-product objectives.

DIAGRAM

Evaluation pipeline for KV cache and nearest neighbor experiments

This diagram shows how TurboQuant is evaluated on empirical distortion, KV cache compression, and nearest neighbor search.

PROCESS

How TurboQuant Handles a Quantization Session

01
MSE Optimal TurboQuant
TurboQuant applies a Random Rotation Matrix Π to map vectors onto Sd−1, then uses Lloyd-Max Quantizer per coordinate to minimize MSE.
02
Inner Product TurboQuant
TurboQuant runs MSE Optimal TurboQuant at bit-width b−1, computes the residual, and prepares it for unbiased inner-product handling.
03
QJL: 1-bit inner product quantization
TurboQuant feeds the residual into QJL, using a Gaussian matrix S and sign sketch to get a 1-bit unbiased inner-product estimator.
04
Lower Bound and Empirical Validation
TurboQuant compares its distortion against Shannon Lower Bound and Yao-based lower bounds, then validates on KV cache and nearest neighbor tasks.

KEY CONTRIBUTIONS

Key Contributions

01
MSE Optimized TurboQuant
TurboQuant designs an MSE Optimal TurboQuant that achieves Dmse ≤ √(3π/2)·4−b and concrete distortions 0.36, 0.117, 0.03, 0.009 for b=1,2,3,4 using Random Rotation Matrix Π and Lloyd-Max Quantizer.
02
Inner Product TurboQuant
TurboQuant introduces Inner Product TurboQuant, a two-stage scheme with QJL that is unbiased and satisfies Dprod ≤ √(3π2)·∥y∥2²/(d·4b), with 1.57/d at b=1.
03
Information-theoretic Lower Bounds
TurboQuant proves lower bounds Dmse ≥ 4−b and Dprod ≥ ∥y∥2²/(d·4b), showing TurboQuant is within ≈2.7× of Shannon’s limit and much tighter at low bit-widths.

RESULTS

By the Numbers

Mean squared error Dmse

0.009

at b=4 vs Shannon lower bound 4−4=0.0625

Inner product error Dprod

1.57/d

at b=1 vs lower bound 1/(d·4)

KV cache compression

3.5 bits per channel

absolute quality neutrality vs full cache at 16 bits

Nearest neighbor recall

1@k near 1.0

with indexing time essentially zero vs PQ and RabitQ

TurboQuant is evaluated on DBpedia OpenAI3 embeddings, LongBench-E, and needle-in-a-haystack tests, showing near-lower-bound distortion and 4.5×+ KV compression without quality loss.

BENCHMARK

By the Numbers

TurboQuant is evaluated on DBpedia OpenAI3 embeddings, LongBench-E, and needle-in-a-haystack tests, showing near-lower-bound distortion and 4.5×+ KV compression without quality loss.

BENCHMARK

LongBench-V1 average scores for Llama-3.1-8B-Instruct with KV compression

Average LongBench-V1 score across SingleQA, MultiQA, Summarization, Few shot, Synthetic, and Code for different KV cache schemes.

KEY INSIGHT

The Counterintuitive Finding

TurboQuant achieves absolute quality neutrality on LongBench-V1 at just 3.5 bits per channel, matching a full 16-bit KV cache average score of 50.06.

This is surprising because conventional wisdom suggests aggressive KV cache quantization should noticeably degrade long-context reasoning, yet TurboQuant preserves performance even at over 4.5× compression.

WHY IT MATTERS

What this unlocks for the field

TurboQuant makes it practical to deploy long-context LLMs and high-dimensional vector databases with near-Shannon-optimal compression and unbiased inner-product queries.

Builders can now quantize KV caches and embedding indices online, at 2.5–3.5 bits per channel, while keeping recall and generation quality indistinguishable from full precision.

~14 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…