GradMem: Learning to Write Context into Memory with Test-Time Gradient Descent

AuthorsYuri Kuratov, Matvey Kairov, Aydar Bulatov et al.

2026

TL;DR

GradMem uses test-time gradient descent on prefix memory tokens to write context, storing up to 96 key–value pairs with 88.4% exact match using only 8 memory vectors.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Long-context LMs Need Compact Reusable Memory, Not Huge KV-caches

Many large language model applications require conditioning on long contexts, but Transformers typically rely on a large per-layer KV-cache of past activations, which incurs substantial memory overhead.

This KV-cache strategy makes it hard to reuse information across queries and does not naturally produce a portable, compact memory representation, limiting efficient context removal and multi-query reuse.

HOW IT WORKS

GradMem: Gradient-based Context Memorization

GradMem uses a WRITE phase, READ phase, context encoder Eθ, WRITE objective Lwrite, and meta-learned initialization M0 to optimize memory tokens with test-time gradient descent.

You can think of GradMem like learning to write onto a tiny RAM module: gradients iteratively refine a small memory block instead of repeatedly streaming the whole disk-like context.

This gradient-based WRITE mechanism lets GradMem trade extra compute for better compression, enabling few-step context encoding into fixed-size memory that a plain context window or forward-only encoder cannot match.

DIAGRAM

GradMem WRITE and READ Flow in Context Removal Setting

This diagram shows how GradMem encodes a context into memory during WRITE and then answers a query using only memory and query during READ under context removal.

DIAGRAM

GradMem Evaluation Pipeline on Associative KV-retrieval

This diagram shows how GradMem is trained and evaluated on associative KV-retrieval, including baselines and the context removal constraint.

PROCESS

How GradMem Handles a Context Removal Task Instance

  1. 01

    WRITE phase

    GradMem receives context C and initializes memory from the meta-learned initialization M0, treating memory tokens as writable parameters for this example.

  2. 02

    Context encoder Eθ

    GradMem runs the context encoder Eθ by prepending current memory to C, computing token predictions needed for the WRITE objective Lwrite.

  3. 03

    Test-time gradient descent on memory

    GradMem minimizes Lwrite(M; C) with K gradient steps on memory tokens only, producing an example-specific MK that encodes unpredictable context information.

  4. 04

    READ phase

    GradMem discards C, concatenates MK with query Q, and predicts target Y using fθ(Y | MK, Q) under the context removal constraint.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    GradMem: gradient-based context memorization

    GradMem introduces a WRITE phase that optimizes prefix memory tokens via test-time gradient descent on a self-supervised WRITE objective Lwrite, while keeping θ frozen and using a meta-learned initialization M0.

  • 02

    Few-step gradient writing

    GradMem meta-trains memory so that K ≤5 gradient descent steps reliably encode task-relevant information, enabling prediction from [MK; Q] with the original context removed.

  • 03

    Gradient-based memory updates outperform forward-only writing

    On associative KV-retrieval with 8 memory vectors, GradMem with 5 WRITE steps reaches 88.4% exact match on 96 pairs, while forward-only RMT with the same memory reaches only 12.9% on 64 pairs.

RESULTS

By the Numbers

Exact match 96 pairs

88.4%

+75.5 over RMT x1 on 64 pairs (12.9%)

Exact match 64 pairs

99.1%

+79.8 over RMT x1 on 16 pairs (19.3%)

Exact match 32 pairs

99.9%

+55.6 over RMT x1 on 16 pairs (44.3%)

Short SQuAD EM

54.9%

+12.3 over RMT (42.6%) and +15.9 over ARMT (39.0%)

On associative KV-retrieval, which directly measures memory capacity under context removal, GradMem with 8 memory vectors and 5 gradient WRITE steps stores up to 96 key–value pairs with 88.4% exact match. On Short SQuAD, GradMem with increased K reaches 54.9% exact match, surpassing RMT and ARMT while using the same GPT-2 backbone.

BENCHMARK

By the Numbers

On associative KV-retrieval, which directly measures memory capacity under context removal, GradMem with 8 memory vectors and 5 gradient WRITE steps stores up to 96 key–value pairs with 88.4% exact match. On Short SQuAD, GradMem with increased K reaches 54.9% exact match, surpassing RMT and ARMT while using the same GPT-2 backbone.

BENCHMARK

KV-retrieval: Gradient-based WRITE vs Forward-only WRITE with 8 Memory Vectors

Exact match accuracy on associative KV-retrieval with 96 key–value pairs and 8 memory vectors.

KEY INSIGHT

The Counterintuitive Finding

GradMem with only 8 memory vectors and 5 gradient WRITE steps achieves 88.4% exact match on 96 key–value pairs, while forward-only RMT collapses to 12.9% on 64 pairs.

This is surprising because both use the same architecture and memory size, yet simply changing the WRITE rule from forward-only to gradient-based yields a 75.5 percentage point improvement.

WHY IT MATTERS

What this unlocks for the field

GradMem shows that a small, gradient-optimized prefix memory can replace large KV-caches, enabling reusable compressed context states across many queries.

Builders can now design systems that read a long document once, run a few WRITE gradient steps, and then answer many questions from a tiny learned memory instead of repeatedly reprocessing the full context.

~12 min read← Back to papers

Related papers

Benchmark

Beyond the Context Window: A Cost-Performance Analysis of Fact-Based Memory vs. Long-Context LLMs for Persistent Agents

Natchanon Pollertlam, Witchayut Kornsuwannawit

· 2026

Beyond the Context Window compares Conversation Segmentation, Fact Extraction, Embedding and Storage, and Retrieval Mechanism in a Mem0-based memory system against long-context GPT-5-mini. On LongMemEval, Beyond the Context Window finds LC GPT-5-mini reaches 82.40% accuracy, 33.4 percentage points above the memory system baseline.

BenchmarkLong-Term Memory

Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs

Mohammad Tavakoli, Alireza Salemi et al.

arXiv 2025 · 2025

LIGHT augments LLMs with Retrieval from the Conversation, Scratchpad Formation and Utilization, and a Working Memory buffer plus noise filtering to answer BEAM’s long-context probing questions. On the BEAM benchmark, LIGHT raises GPT-4.1-nano’s average score at 10M-token conversations from 0.109 to 0.226, a +107.3% gain over the vanilla long-context baseline.

BenchmarkAgent Memory

Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Yuanzhe Hu, Yu Wang, Julian McAuley

ICLR 2026 · 2025

MemoryAgentBench standardizes multi-turn datasets into chunked conversations with memorization prompts, then evaluates long-context agents, RAG agents, and agentic memory agents across Accurate Retrieval, Test-Time Learning, Long-Range Understanding, and Selective Forgetting. On the overall score in Table 3, the GPT-4.1-mini long-context agent reaches 71.8 on Accurate Retrieval tasks compared to 49.2 for the GPT-4o-mini long-context baseline.

Questions about this paper?

Paper: GradMem: Learning to Write Context into Memory with Test-Time Gradient Descent

Answers use this explainer on Memory Papers.

Checking…