Knowledge Packs: Zero-Token Knowledge Delivery via KV Cache Injection

AuthorsAndrey Pustovit

2026

TL;DR

Knowledge Packs use KV–Prefix Equivalence to swap RAG text for pre-computed KV caches, matching HotpotQA accuracy while saving up to 95% tokens.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

RAG agents pay a linear token tax for repeated search (700+ tokens for 5 lookups)

RAG inserts retrieved passages directly into prompts, so 5 searches can consume 700+ tokens on facts alone, exhausting context and budget.

Agentic workflows that repeatedly search knowledge bases hit context ceilings and API costs, degrading multi-hop reasoning and limiting long-horizon tool-using agents.

HOW IT WORKS

Knowledge Packs via KV Cache Injection and KV–Prefix Equivalence

Knowledge Packs center on KV Cache Injection, KV–Prefix Equivalence, Banked Routing, and KV Composition to pre-compute factual KV prefixes and reuse them across queries.

You can think of Knowledge Packs like RAM snapshots: instead of retyping documents every time, you load a saved memory state before answering.

This KV-first design lets Knowledge Packs deliver knowledge and value-space steering that no plain context window can express, all at effectively zero token cost.

DIAGRAM

Inference Flow for Knowledge Packs: From Query to Zero-Token Knowledge Use

This diagram shows how Knowledge Packs process a query using pre-computed KV caches, chat templates, and value steering during inference.

DIAGRAM

Evaluation Pipeline for Knowledge Packs on HotpotQA and Accumulation Scaling

This diagram shows how Knowledge Packs are evaluated across HotpotQA, accumulation scaling, and value steering experiments.

PROCESS

How Knowledge Packs Handles a Query Session

  1. 01

    Write phase offline once

    Knowledge Packs runs fact sentences through KV Cache Injection using the chat template, storing per-layer keys and values for later reuse.

  2. 02

    Banked Routing at query time

    Knowledge Packs embeds the query with BGE-large, selects a fact bank via Banked Routing, and locates the most relevant cached knowledge.

  3. 03

    KV Cache Injection as prefix

    Knowledge Packs loads the selected KV cache as a prefix, ensuring KV–Prefix Equivalence with a hypothetical joint F ◦ q forward pass.

  4. 04

    Dual-channel generation

    Knowledge Packs optionally applies mid-layer value steering deltas and then generates the response, combining knowledge delivery and behavioral control.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    KV–Prefix Equivalence proof and verification

    Knowledge Packs proves KV–Prefix Equivalence for causal transformers and verifies it with 0/700 divergences between KV-chat and RAG on HotpotQA for Qwen3-8B and Llama-3.1-8B.

  • 02

    Zero-token knowledge delivery with banked routing

    Knowledge Packs uses KV Cache Injection and Banked Routing to save up to 95% tokens at 5 retrieval steps while scaling to 5,000 facts with 100% routing accuracy.

  • 03

    Value-space steering and dual-channel KV

    Knowledge Packs introduces mid-layer value steering via contrastive V-deltas, composing multiple directions and coexisting with knowledge delivery at α≤0.7 without EM loss.

RESULTS

By the Numbers

Overall EM Qwen3-8B

65.2%

+36.8pp over Baseline

Overall EM Llama-3.1-8B

61.5%

+32.0pp over Baseline

Token savings at 5 searches

704 tokens

95% fewer tokens than RAG on Qwen3-8B

Routing accuracy at 5,000 facts

100%

with 4.2 MB storage using Banked Routing

On HotpotQA, Knowledge Packs’ KV-chat matches RAG exactly at 65.2% EM on Qwen3-8B and 61.5% EM on Llama-3.1-8B while using zero retrieval tokens. Accumulation experiments show Knowledge Packs saving 704–693 tokens at 5 searches, proving constant-cost knowledge reuse for agentic systems.

BENCHMARK

By the Numbers

On HotpotQA, Knowledge Packs’ KV-chat matches RAG exactly at 65.2% EM on Qwen3-8B and 61.5% EM on Llama-3.1-8B while using zero retrieval tokens. Accumulation experiments show Knowledge Packs saving 704–693 tokens at 5 searches, proving constant-cost knowledge reuse for agentic systems.

BENCHMARK

HotpotQA results for Qwen3-8B (Overall EM, N=500)

Exact match accuracy on HotpotQA for Qwen3-8B across Baseline, RAG, KV-chat, and KV-raw.

BENCHMARK

Accumulation scaling token cost on Qwen3-8B

Average tokens per query vs number of searches for RAG and Knowledge Packs KV on Qwen3-8B.

KEY INSIGHT

The Counterintuitive Finding

Knowledge Packs shows that KV-chat and RAG produce 0/700 divergences on HotpotQA, yielding byte-identical outputs despite completely different delivery mechanisms.

This is surprising because prior work claimed KV caches outperform RAG, but Knowledge Packs reveals that 5–10pp gaps often come solely from chat template mis-formatting.

WHY IT MATTERS

What this unlocks for the field

Knowledge Packs makes it practical to treat knowledge as reusable KV snapshots, enabling long-horizon agents to accumulate thousands of facts without token blowup.

With value-space steering layered on top, builders can now jointly control knowledge and style through KV states that no text prompt could ever generate.

~12 min read← Back to papers

Related papers

RAG

Memory as Metabolism: A Design for Companion Knowledge Systems

Stefan Miteski

· 2026

Memory as Metabolism defines companion knowledge systems with five retention operations (TRIAGE, DECAY, CONTEXTUALIZE, CONSOLIDATE, AUDIT) plus memory gravity and minority-hypothesis retention over a raw buffer, active wiki, and cold memory. Instead of benchmark gains, Memory as Metabolism’s main result is a governance specification that separates descriptive, taxonomic, and normative claims and predicts improved coherence stability, fragility resistance, monoculture resistance, and effective minority-hypothesis influence for companion wikis.

RAG

Memory for Autonomous LLM Agents:Mechanisms, Evaluation, and Emerging Frontiers

Pengfei Du

· 2026

Memory for Autonomous LLM Agents decomposes agent memory into a POMDP-grounded write–manage–read loop, a three-dimensional taxonomy, and five mechanism families spanning context compression, retrieval stores, reflection, hierarchical virtual context, and policy-learned management. Memory for Autonomous LLM Agents synthesizes results like Voyager’s 15.3× tech-tree speedup and MemoryArena’s 80%→45% drop to show that memory architecture often matters more than backbone choice.

Questions about this paper?

Paper: Knowledge Packs: Zero-Token Knowledge Delivery via KV Cache Injection

Answers use this explainer on Memory Papers.

Checking…