Titans: Learning to Memorize at Test Time

AuthorsAli Behrouz, Peilin Zhong, Vahab Mirrokni

arXiv 20252025

TL;DR

Titans uses a deep neural long term memory with surprise based gradient updates and achieves 39.62% LAMBADA accuracy vs 37.06% for DeltaNet at 760M parameters.

THE PROBLEM

Transformers break on million token contexts due to quadratic attention

Titans is motivated by the fact that accurate attention has quadratic cost, limiting Transformers to a fixed length context window and challenging long context tasks.

This limitation hurts language modeling, video understanding, and long term time series forecasting, where models must reason over sequences larger than 2M tokens without losing recall, generalization, or reasoning.

HOW IT WORKS

Titans — neural long term memory with surprise driven updates

Titans centers on a deep Neural Memory Module, a Core short term attention branch, and a Persistent Memory branch combined into three architectural variants MAC, MAG, and MAL.

You can think of the Core as working memory RAM, the Neural Memory Module as long term disk storage updated by surprise, and Persistent Memory as firmware encoding task specific priors.

This surprise driven, gated update rule lets Titans store abstractions of long past sequences in parameters, enabling reasoning beyond any fixed context window that plain attention or linear recurrent models cannot match.

DIAGRAM

Titans memory taxonomy and roles

This diagram shows how Titans organizes short term, long term, and persistent memory components inspired by human memory systems.

DIAGRAM

Titans training and evaluation pipeline

This diagram shows how Titans is trained on language data and then evaluated on language modeling, reasoning, and needle in haystack benchmarks.

PROCESS

How Titans Handles a Long Sequence

  1. 01

    Learning Process and Surprise Metric

    Titans computes gradients of the associative loss in the Neural Memory Module to measure surprise and updates memory using momentum and data dependent learning rates.

  2. 02

    Forgetting Mechanism

    Titans applies an adaptive forgetting gate with weight decay, scaling previous memory states in the Neural Memory Module before adding new surprise updates.

  3. 03

    Memory as a Context

    Titans (MAC) segments sequences, queries the Neural Memory Module with current segment queries, and concatenates retrieved context with Persistent Memory before attention.

  4. 04

    Gated Memory

    Titans (MAG) runs sliding window attention in the Core and combines its output with Neural Memory Module outputs through a learned non linear gating.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Neural Memory Module

    Titans introduces a deep Neural Memory Module trained online with surprise based gradients, momentum, and weight decay to store key value associations at test time.

  • 02

    Titans Architectures

    Titans defines three variants, Memory as a Context, Memory as a Gate, and Memory as a Layer, combining long term memory with short term attention and Persistent Memory.

  • 03

    Long Context Benchmarks

    Titans achieves 99.0 S NIAH PK at 8K tokens and 95.2 S NIAH W at 16K tokens in the MAC variant, surpassing linear recurrent baselines on RULER and BABILong.

RESULTS

By the Numbers

LMB. acc

39.62%

+2.56 over DeltaNet at 760M parameters

Wiki. ppl

19.93

vs 24.37 for DeltaNet at 760M parameters

S-NIAH-W 16K

95.2

+95.2 over Mamba2 which scores 0.0 at 16K

BABILong fine tune

near 100.0

Titans MAC surpasses GPT4 and Llama3.1 70B on BABILong tasks

These numbers come from the language modeling table and the S NIAH and BABILong evaluations, which test perplexity, reasoning accuracy, and effective context length. MAIN_RESULT shows Titans (MAC) improves LAMBADA accuracy to 39.62% while maintaining strong long context retrieval compared to DeltaNet and Mamba2.

BENCHMARK

By the Numbers

These numbers come from the language modeling table and the S NIAH and BABILong evaluations, which test perplexity, reasoning accuracy, and effective context length. MAIN_RESULT shows Titans (MAC) improves LAMBADA accuracy to 39.62% while maintaining strong long context retrieval compared to DeltaNet and Mamba2.

BENCHMARK

Benchmark: Language modeling and commonsense reasoning (Table 1, 760M params)

Average accuracy across PIQA, HellaSwag, WinoGrande, ARC-e, ARC-c, SIQA, and BoolQ.

KEY INSIGHT

The Counterintuitive Finding

Titans with no attention, just the Neural Memory Module, reaches an average 47.83 reasoning score, beating several attention based baselines at similar scale.

This is surprising because conventional wisdom assumes attention is required for strong reasoning, yet Titans shows deep long term memory alone can rival hybrid architectures.

WHY IT MATTERS

What this unlocks for the field

Titans unlocks test time learning of long term associations through a deep Neural Memory Module that can keep updating beyond 2M token contexts.

Builders can now design sequence models where memory capacity grows with parameters, not context window, enabling robust reasoning in genomics, time series, and ultra long document tasks.

~16 min read← Back to papers

Related papers

Memory ArchitectureSurvey

Multi-Agent Memory from a Computer Architecture Perspective: Visions and Challenges Ahead

Zhongming Yu, Naicheng Yu et al.

arXiv 2026 · 2026

Multi-Agent Memory Architecture organizes **Agent IO Layer**, **Agent Cache Layer**, and **Agent Memory Layer** plus **Agent Cache Sharing** and **Agent Memory Access** protocols into a unified architectural framing for multi-agent systems. The position-only SYS_NAME proposes no benchmark MAIN_RESULT or numeric comparison against any baseline.

RAGMemory ArchitectureLong-Term Memory

From RAG to Memory: Non-Parametric Continual Learning for Large Language Models

Bernal Jiménez Gutiérrez, Yiheng Shu et al.

ICML 2025 · 2025

HippoRAG 2 combines **Offline Indexing**, a schema-less **Knowledge Graph**, **Dense-Sparse Integration**, **Deeper Contextualization**, and **Recognition Memory** into a neuro-inspired non-parametric memory system for LLMs. On the joint RAG benchmark suite, HippoRAG 2 achieves 59.8 average F1 versus 57.0 for NV-Embed-v2, including 71.0 F1 on 2Wiki compared to 61.5 for NV-Embed-v2.

Agent MemoryMemory Architecture

General Agentic Memory Via Deep Research

B.Y. Yan, Chaofan Li et al.

arXiv 2025 · 2025

General Agentic Memory (GAM) combines a **Memorizer**, **Researcher**, **page-store**, and **memory** to keep full trajectories while constructing lightweight guidance for deep research. On RULER 128K retrieval, GAM achieves 97.70% accuracy compared to 94.25% for RAG using GPT-4o-mini, while also reaching 64.07 F1 on HotpotQA-56K.