Titans: Learning to Memorize at Test Time

AuthorsAli Behrouz, Peilin Zhong, Vahab Mirrokni

arXiv 20252025

TL;DR

Titans uses a deep neural long-term memory trained by surprise-driven gradient updates with momentum and forgetting, reaching 52.51 average accuracy vs 51.49 for Gated DeltaNet-H2 at 760M parameters.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Transformers Break on Million Token Contexts and Needle Tasks

Transformers model dependencies accurately but have quadratic cost, limiting context windows and making million token sequences impractical for many tasks.

Linear Transformers compress very long contexts into fixed-size states, which cannot properly store long histories, hurting language modeling, reasoning, and recall-intensive tasks.

HOW IT WORKS

Titans — Deep Neural Memory with Surprise Driven Updates

Titans combines a Core short-term attention module, a deep Long-term Memory MLP, and Persistent Memory tokens, with integration variants MAC, MAG, and MAL.

You can think of Titans like a brain with a focused working memory window plus a deep, writable long-term store that keeps surprising events, guided by a meta-learning rule.

This surprise-driven, gated update lets Titans remember and retrieve information far beyond the context window, enabling BABILong and 2M token needle tasks that plain attention cannot handle.

DIAGRAM

Titans MAC Inference Flow Across Segments

This diagram shows how Titans (Memory as a Context) retrieves long-term memories per segment and feeds them into attention with persistent tokens.

DIAGRAM

Training and Ablation Pipeline for Titans

This diagram shows how Titans variants and ablations are trained and evaluated on language modeling, reasoning, and long-context benchmarks.

PROCESS

How Titans Handles a Long Sequence

01
Long term Memory
Titans uses the Long-term Memory MLP to encode past keys and values via surprise-driven gradient updates with momentum and adaptive forgetting.
02
Memory as a Context
In the Memory as a Context variant, Titans retrieves segment specific history from Long-term Memory and concatenates it with Persistent Memory tokens and current inputs.
03
Memory as Gating
In Memory as a Gate, Titans runs sliding window attention in the Core and combines it with Long-term Memory outputs through a learned non linear gate.
04
Memory as a Layer
In Memory as a Layer, Titans first applies the Long-term Memory over the sequence and then feeds compressed representations into sliding window attention for final predictions.

KEY CONTRIBUTIONS

Key Contributions

01
Neural long term memory
Titans introduces a deep Long-term Memory module trained at test time with surprise gradients, momentum, and weight decay, equivalent to meta mini batch gradient descent.
02
Titans architectures
Titans defines three integration schemes, Memory as a Context, Memory as a Gate, and Memory as a Layer, combining Core attention, Long-term Memory, and Persistent Memory.
03
Long context performance
Titans (MAC) achieves 52.51 average accuracy vs 51.49 for Gated DeltaNet-H2 at 760M parameters and solves BABILong tasks where GPT-4 fails.

RESULTS

By the Numbers

Avg. accuracy 760M

52.51

+1.02 over Gated DeltaNet-H2

Wiki perplexity 760M

19.93

vs 19.88 for Gated DeltaNet-H2

LMB perplexity 760M

20.12

vs 20.83 for Gated DeltaNet-H2

S NIAH W 16K

95.20

+95.20 over TTT on S NIAH W 16K

On language modeling and commonsense reasoning benchmarks, Titans (MAC) at 760M parameters reaches 52.51 average accuracy on PIQA, HellaSwag, WinoGrande, ARC, SIQA, and BoolQ. On RULER S NIAH W at 16K tokens, Titans (MAC) scores 95.2, showing that Titans maintains effective context far beyond typical Transformer windows.

BENCHMARK

By the Numbers

BENCHMARK

Language Modeling and Reasoning at 400M Parameters

Average accuracy across PIQA, HellaSwag, WinoGrande, ARC-e, ARC-c, SIQA, and BoolQ at 400M parameters.

KEY INSIGHT

The Counterintuitive Finding

Titans (LMM) without any attention reaches 46.17 average accuracy at 340M parameters, beating Transformer++ at 42.92 with full attention.

This is surprising because a pure recurrent memory module, trained via meta gradient updates, surpasses a standard Transformer that can directly attend over the whole 4K training context.

WHY IT MATTERS

What this unlocks for the field

Titans unlocks practical long term memory that can learn and update at test time while scaling to over 2M token contexts and hard needle tasks.

Builders can now design sequence models where attention handles local reasoning and Titans Long-term Memory persistently stores surprising events, enabling robust long document reasoning and streaming applications.

~14 min read← Back to papers

Related papers

Memory Architecture

A Control Architecture for Training-Free Memory Use

Yanzhen Lu, Muchen Jiang et al.

· 2026

TAG routes low-confidence steps to uncertainty-based routing, filters them with guarded acceptance with rollback, chooses between bank selection across rule and exemplar memory, and prunes via evidence-based retirement inside a unified control loop. On SVAMP and ASDiv, TAG reaches 81.0% and 85.2% accuracy, improving over the 74.0% and 77.5% no-memory baselines while a compute-matched Retry baseline stays flat.

arXiv:2604.18206 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…