A Control Architecture for Training-Free Memory Use

AuthorsYanzhen Lu, Muchen Jiang, Zhicheng Qian, Xingyu Zhou

2026

TL;DR

TAG uses uncertainty-gated routing, guarded acceptance, and evidence-based retirement to control training-free prompt memory, adding +7.0 and +7.7 accuracy points on SVAMP and ASDiv.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Prompt memory helps only in the right state: applicability control as the bottleneck

TAG is motivated by the observation that arithmetic accuracy on SVAMP and ASDiv only increases when prompt memory is applied under the right conditions, not just exposed. In a locked protocol, a compute-matched retry produces no lift on these strict reasoning benchmarks, showing that naive second passes waste compute.

Without applicability control, prompt-injected memory can be irrelevant or harmful in many states, even when retrieval is strong. This breaks training-free context optimization pipelines and leads to both quality degradation and unnecessary calls per query.

HOW IT WORKS

TAG — uncertainty-gated routing, guarded acceptance, and memory governance

TAG’s core mechanism combines uncertainty-based routing, confidence-based selective acceptance, bank selection across rule and exemplar memory, and evidence-based retirement inside a unified control loop. TAG treats applicability as a latent intervention value, using confidence and structural guards to decide when memory-conditioned actions should override the baseline.

You can think of TAG like a CPU with a speculative execution unit and a cache: low-confidence steps trigger a speculative memory-assisted pass, and only beneficial speculations are committed. The memory banks act like a card catalog of rules and exemplars, with governance retiring entries that repeatedly hurt performance.

This KEY_MECHANISM lets TAG exploit prompt memory far beyond what a plain context window or always-retrieve RAG pipeline can do, because TAG explicitly controls when to query memory, when to trust it, and which bank to expose instead of blindly injecting retrieved content.

DIAGRAM

TAG Inference Flow for a Single Step

This diagram shows how TAG routes a single step through baseline decoding, uncertainty-based routing, memory-conditioned second pass, and guarded acceptance with rollback.

DIAGRAM

TAG Evaluation and Governance Pipeline

This diagram shows how TAG runs fit/dev governance, freezes thresholds and banks, and then evaluates on test benchmarks under a locked training-free protocol.

PROCESS

How TAG Handles a Reasoning Example

01
Unified Control Loop
TAG starts with the unified control loop, decoding a baseline action and confidence before any uncertainty-based routing or bank selection across rule and exemplar memory happens.
02
Uncertainty-Based Routing
TAG applies uncertainty-based routing by comparing the baseline confidence c_t to threshold τ and only querying memory when c_t < τ to control calls per query.
03
Guarded Acceptance with Rollback
TAG runs a memory-conditioned second pass and uses confidence-based selective acceptance with structural guards g_t, rolling back to the baseline when the margin m or guards fail.
04
Evidence-Based Retirement
During fit/dev, TAG logs utility evidence for each memory entry and uses evidence-based retirement to remove entries whose upper confidence bound on mean utility falls below zero.

KEY CONTRIBUTIONS

Key Contributions

01
Applicability-Control Formulation
TAG formulates prompt-memory applicability as a control problem under locked, compute-matched comparison, separating retrieval exposure from uncertainty-based routing and confidence-based selective acceptance decisions.
02
Training-Free Control Architecture
TAG introduces a training-free control architecture that combines uncertainty-based routing, guarded acceptance with rollback, bank selection across rule and exemplar memory, and evidence-based retirement into a single intervention policy.
03
Mechanism Evidence and Benchmarks
TAG shows +7.0 accuracy on SVAMP and +7.7 on ASDiv over baseline and provides mechanism evidence that confidence separates helpful from harmful rule-bank interventions under fixed retrieval.

RESULTS

By the Numbers

SVAMP accuracy

81.0 %

+7.0 over Baseline

ASDiv accuracy

85.2 %

+7.7 over Baseline

MultiArith accuracy

89.1 %

+1.3 over Baseline

WebShop ∆acc

+0.0144

Help−Hurt = +13 with −0.5500 calls per query

On SVAMP, ASDiv, and MultiArith, which test arithmetic reasoning, TAG improves from 74.0 to 81.0, 77.5 to 85.2, and 87.8 to 89.1 accuracy respectively. These MAIN_RESULT numbers show that TAG’s applicability control, not just memory exposure or retry, drives gains under a locked training-free protocol.

BENCHMARK

By the Numbers

BENCHMARK

Reasoning results (multi-seed means, %)

Accuracy on SVAMP with compute-matched baselines and TAG.

KEY INSIGHT

The Counterintuitive Finding

TAG shows that a compute-matched Retry baseline remains flat at 74.0 on SVAMP and 77.5 on ASDiv, while TAG gains +7.0 and +7.7 points. This means that simply adding a second pass without memory or control does not improve strict arithmetic benchmarks.

This is counterintuitive because many practitioners assume more decoding compute or retries always help. TAG breaks that assumption by showing that applicability control, not raw compute or retrieval volume, is what actually unlocks memory gains.

WHY IT MATTERS

What this unlocks for the field

TAG unlocks a concrete way to use training-free prompt memory safely and effectively by deciding when to query, trust, and retire memory entries. Developers can now deploy uncertainty-based routing and guarded acceptance with rollback on top of existing LLMs without weight updates.

With TAG, builders can turn experience libraries and rule banks into controlled interventions rather than uncontrolled context dumps. This makes it practical to add training-free experience libraries to arithmetic solvers, QA systems, and agents while keeping compute budgets and error rates under control.

~12 min read← Back to papers

Related papers

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Memory Architecture

APEX-MEM: Agentic Semi-Structured Memory with Temporal Reasoning for Long-Term Conversational AI

Pratyay Banerjee, Masud Moshtaghi et al.

· 2026

APEX-MEM constructs a semi-structured temporal property graph using Ontology, Entity and Property Resolution, Fact Extraction, and Graph Agents to store and query conversational memory. On LOCOMO, APEX-MEM with GPT5 achieves 88.88% overall accuracy, beating MIRIX at 85.38% by 3.50 percentage points.

arXiv:2604.14362 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…