GradMem: Learning to Write Context into Memory with Test-Time Gradient Descent

AuthorsYuri Kuratov, Matvey Kairov, Aydar Bulatov et al.

2026

TL;DR

GradMem uses test-time gradient descent on prefix memory tokens to write context, storing up to 96 key–value pairs with 88.4% exact match using only 8 memory vectors.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Long-context LMs Need Compact Reusable Memory, Not Huge KV-caches

Many large language model applications require conditioning on long contexts, but Transformers typically rely on a large per-layer KV-cache of past activations, which incurs substantial memory overhead.

This KV-cache strategy makes it hard to reuse information across queries and does not naturally produce a portable, compact memory representation, limiting efficient context removal and multi-query reuse.

HOW IT WORKS

GradMem: Gradient-based Context Memorization

GradMem uses a WRITE phase, READ phase, context encoder Eθ, WRITE objective Lwrite, and meta-learned initialization M0 to optimize memory tokens with test-time gradient descent.

You can think of GradMem like learning to write onto a tiny RAM module: gradients iteratively refine a small memory block instead of repeatedly streaming the whole disk-like context.

This gradient-based WRITE mechanism lets GradMem trade extra compute for better compression, enabling few-step context encoding into fixed-size memory that a plain context window or forward-only encoder cannot match.

DIAGRAM

GradMem WRITE and READ Flow in Context Removal Setting

This diagram shows how GradMem encodes a context into memory during WRITE and then answers a query using only memory and query during READ under context removal.

DIAGRAM

GradMem Evaluation Pipeline on Associative KV-retrieval

This diagram shows how GradMem is trained and evaluated on associative KV-retrieval, including baselines and the context removal constraint.

PROCESS

How GradMem Handles a Context Removal Task Instance

01
WRITE phase
GradMem receives context C and initializes memory from the meta-learned initialization M0, treating memory tokens as writable parameters for this example.
02
Context encoder Eθ
GradMem runs the context encoder Eθ by prepending current memory to C, computing token predictions needed for the WRITE objective Lwrite.
03
Test-time gradient descent on memory
GradMem minimizes Lwrite(M; C) with K gradient steps on memory tokens only, producing an example-specific MK that encodes unpredictable context information.
04
READ phase
GradMem discards C, concatenates MK with query Q, and predicts target Y using fθ(Y | MK, Q) under the context removal constraint.

KEY CONTRIBUTIONS

Key Contributions

01
GradMem: gradient-based context memorization
GradMem introduces a WRITE phase that optimizes prefix memory tokens via test-time gradient descent on a self-supervised WRITE objective Lwrite, while keeping θ frozen and using a meta-learned initialization M0.
02
Few-step gradient writing
GradMem meta-trains memory so that K ≤5 gradient descent steps reliably encode task-relevant information, enabling prediction from [MK; Q] with the original context removed.
03
Gradient-based memory updates outperform forward-only writing
On associative KV-retrieval with 8 memory vectors, GradMem with 5 WRITE steps reaches 88.4% exact match on 96 pairs, while forward-only RMT with the same memory reaches only 12.9% on 64 pairs.

RESULTS

By the Numbers

Exact match 96 pairs

88.4%

+75.5 over RMT x1 on 64 pairs (12.9%)

Exact match 64 pairs

99.1%

+79.8 over RMT x1 on 16 pairs (19.3%)

Exact match 32 pairs

99.9%

+55.6 over RMT x1 on 16 pairs (44.3%)

Short SQuAD EM

54.9%

+12.3 over RMT (42.6%) and +15.9 over ARMT (39.0%)

On associative KV-retrieval, which directly measures memory capacity under context removal, GradMem with 8 memory vectors and 5 gradient WRITE steps stores up to 96 key–value pairs with 88.4% exact match. On Short SQuAD, GradMem with increased K reaches 54.9% exact match, surpassing RMT and ARMT while using the same GPT-2 backbone.

BENCHMARK

By the Numbers

BENCHMARK

KV-retrieval: Gradient-based WRITE vs Forward-only WRITE with 8 Memory Vectors

Exact match accuracy on associative KV-retrieval with 96 key–value pairs and 8 memory vectors.

KEY INSIGHT

The Counterintuitive Finding

GradMem with only 8 memory vectors and 5 gradient WRITE steps achieves 88.4% exact match on 96 key–value pairs, while forward-only RMT collapses to 12.9% on 64 pairs.

This is surprising because both use the same architecture and memory size, yet simply changing the WRITE rule from forward-only to gradient-based yields a 75.5 percentage point improvement.

WHY IT MATTERS

What this unlocks for the field

GradMem shows that a small, gradient-optimized prefix memory can replace large KV-caches, enabling reusable compressed context states across many queries.

Builders can now design systems that read a long document once, run a few WRITE gradient steps, and then answer many questions from a tiny learned memory instead of repeatedly reprocessing the full context.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…