Knowledge Packs: Zero-Token Knowledge Delivery via KV Cache Injection

AuthorsAndrey Pustovit

2026

TL;DR

Knowledge Packs use KV–Prefix Equivalence to swap RAG text for pre-computed KV caches, matching HotpotQA accuracy while saving up to 95% tokens.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

RAG agents pay a linear token tax for repeated search (700+ tokens for 5 lookups)

RAG inserts retrieved passages directly into prompts, so 5 searches can consume 700+ tokens on facts alone, exhausting context and budget.

Agentic workflows that repeatedly search knowledge bases hit context ceilings and API costs, degrading multi-hop reasoning and limiting long-horizon tool-using agents.

HOW IT WORKS

Knowledge Packs via KV Cache Injection and KV–Prefix Equivalence

Knowledge Packs center on KV Cache Injection, KV–Prefix Equivalence, Banked Routing, and KV Composition to pre-compute factual KV prefixes and reuse them across queries.

You can think of Knowledge Packs like RAM snapshots: instead of retyping documents every time, you load a saved memory state before answering.

This KV-first design lets Knowledge Packs deliver knowledge and value-space steering that no plain context window can express, all at effectively zero token cost.

DIAGRAM

Inference Flow for Knowledge Packs: From Query to Zero-Token Knowledge Use

This diagram shows how Knowledge Packs process a query using pre-computed KV caches, chat templates, and value steering during inference.

DIAGRAM

Evaluation Pipeline for Knowledge Packs on HotpotQA and Accumulation Scaling

This diagram shows how Knowledge Packs are evaluated across HotpotQA, accumulation scaling, and value steering experiments.

PROCESS

How Knowledge Packs Handles a Query Session

01
Write phase offline once
Knowledge Packs runs fact sentences through KV Cache Injection using the chat template, storing per-layer keys and values for later reuse.
02
Banked Routing at query time
Knowledge Packs embeds the query with BGE-large, selects a fact bank via Banked Routing, and locates the most relevant cached knowledge.
03
KV Cache Injection as prefix
Knowledge Packs loads the selected KV cache as a prefix, ensuring KV–Prefix Equivalence with a hypothetical joint F ◦ q forward pass.
04
Dual-channel generation
Knowledge Packs optionally applies mid-layer value steering deltas and then generates the response, combining knowledge delivery and behavioral control.

KEY CONTRIBUTIONS

Key Contributions

01
KV–Prefix Equivalence proof and verification
Knowledge Packs proves KV–Prefix Equivalence for causal transformers and verifies it with 0/700 divergences between KV-chat and RAG on HotpotQA for Qwen3-8B and Llama-3.1-8B.
02
Zero-token knowledge delivery with banked routing
Knowledge Packs uses KV Cache Injection and Banked Routing to save up to 95% tokens at 5 retrieval steps while scaling to 5,000 facts with 100% routing accuracy.
03
Value-space steering and dual-channel KV
Knowledge Packs introduces mid-layer value steering via contrastive V-deltas, composing multiple directions and coexisting with knowledge delivery at α≤0.7 without EM loss.

RESULTS

By the Numbers

Overall EM Qwen3-8B

65.2%

+36.8pp over Baseline

Overall EM Llama-3.1-8B

61.5%

+32.0pp over Baseline

Token savings at 5 searches

704 tokens

95% fewer tokens than RAG on Qwen3-8B

Routing accuracy at 5,000 facts

100%

with 4.2 MB storage using Banked Routing

On HotpotQA, Knowledge Packs’ KV-chat matches RAG exactly at 65.2% EM on Qwen3-8B and 61.5% EM on Llama-3.1-8B while using zero retrieval tokens. Accumulation experiments show Knowledge Packs saving 704–693 tokens at 5 searches, proving constant-cost knowledge reuse for agentic systems.

BENCHMARK

By the Numbers

BENCHMARK

HotpotQA results for Qwen3-8B (Overall EM, N=500)

Exact match accuracy on HotpotQA for Qwen3-8B across Baseline, RAG, KV-chat, and KV-raw.

BENCHMARK

Accumulation scaling token cost on Qwen3-8B

Average tokens per query vs number of searches for RAG and Knowledge Packs KV on Qwen3-8B.

KEY INSIGHT

The Counterintuitive Finding

Knowledge Packs shows that KV-chat and RAG produce 0/700 divergences on HotpotQA, yielding byte-identical outputs despite completely different delivery mechanisms.

This is surprising because prior work claimed KV caches outperform RAG, but Knowledge Packs reveals that 5–10pp gaps often come solely from chat template mis-formatting.

WHY IT MATTERS

What this unlocks for the field

Knowledge Packs makes it practical to treat knowledge as reusable KV snapshots, enabling long-horizon agents to accumulate thousands of facts without token blowup.

With value-space steering layered on top, builders can now jointly control knowledge and style through KV states that no text prompt could ever generate.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…