Test-time regression: a unifying framework for designing sequence models with associative memory

AuthorsKe Alexander Wang, Jiaxin Shi, Emily B. Fox

2025

TL;DR

Test-time regression reframes sequence layers as associative memories that solve a regression problem at inference, unifying linear attention, SSMs, fast weights, online learners, and softmax attention.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Sequence architectures are fragmented and poorly understood

Sequence models like Transformers, linear attention, and SSMs have emerged from separate lines of investigation with idiosyncratic notations and motivations.

This fragmented, empirically driven development obscures shared structure, making it hard to explain why some architectures work better and how to systematically design new associative memory layers for sequence modeling.

HOW IT WORKS

Test-time regression layers as associative memory

Test-time regression defines memorization as regression, memory retrieval, and test-time regression layers that first fit a regressor to key value pairs, then apply it to queries.

You can think of this like a CPU that, at each forward pass, reprograms a tiny hardware accelerator from the current sequence, then uses it as a content addressable associative memory.

This KEY_MECHANISM lets test-time regression implement behaviors like softmax attention, linear attention, and SSMs using a single regression view, going beyond what a fixed context window can express.

DIAGRAM

Forward pass as memorization then retrieval

This diagram shows how test-time regression performs a two step associative recall process over a sequence during the forward pass.

DIAGRAM

Design space of test-time regression layers

This diagram shows how different sequence architectures arise from specific regression choices in the test-time regression framework.

PROCESS

How test-time regression handles a sequence modeling task

01
Memorization as regression
Test-time regression takes key value pairs and solves a weighted least squares or kernel regression problem to build an associative memory map over the sequence.
02
Memory retrieval as function application
Given a query, test-time regression applies the learned regressor to the query to retrieve a value that matches or interpolates memorized keys.
03
A recipe for designing your own sequence layer
Test-time regression specifies regression weights, regressor function class, and optimization algorithm, yielding concrete architectures like linear attention or softmax attention.
04
Deriving existing architectures from regression
Test-time regression shows linear attention, state space models, fast weight programmers, online learning layers, and softmax attention are all special cases of associative regression.

KEY CONTRIBUTIONS

Key Contributions

01
Test-time regression as a framework for designing sequence layers
Test-time regression formalizes memorization as regression and retrieval as function application, turning sequence layers into associative memories parameterized by regression choices.
02
Deriving existing architectures from regression
Test-time regression unifies linear attention, feature mapped variants, gated linear attention, state space models, fast weight layers, online learning layers, and softmax attention as regression instances.
03
Higher order generalizations of softmax attention
Test-time regression interprets softmax attention with query key normalization as local constant regression and proposes higher order local linear generalizations for richer interactions.

RESULTS

By the Numbers

Associative recall link

Strong correlation

connects associative recall ability to language modeling performance from prior work

Unified layer types

5+ classes

covers linear attention, SSMs, fast weights, online learners, softmax attention

MQAR capacity insight

P up to 64

shows perfect MQAR with dmodel equal to number of cue response pairs

Regression choices

3 axes

weights, function class, optimizer define the full design space

On synthetic online regression and multi query associative recall setups, test-time regression demonstrates that regression derived layers implicitly perform regression in a single forward pass and that memory capacity, not sequence length, limits MQAR performance.

BENCHMARK

By the Numbers

BENCHMARK

Conceptual comparison of unified architectures

Relative coverage of major sequence layer families under the test-time regression framework.

KEY INSIGHT

The Counterintuitive Finding

Test-time regression shows that for multi query associative recall, difficulty depends on memory capacity, not sequence length, once capacity is sufficient.

This challenges the common assumption that longer sequences inherently make associative recall harder, shifting focus toward designing better associative memories instead.

WHY IT MATTERS

What this unlocks for the field

Test-time regression gives a principled way to design new sequence layers by choosing regression weights, function classes, and optimizers instead of ad hoc architectural tweaks.

Builders can now systematically explore regression inspired memory mechanisms, higher order attention, and test time optimization to create more powerful and interpretable sequence models.

~15 min read← Back to papers

Related papers

Memory Architecture

A Control Architecture for Training-Free Memory Use

Yanzhen Lu, Muchen Jiang et al.

· 2026

TAG routes low-confidence steps to uncertainty-based routing, filters them with guarded acceptance with rollback, chooses between bank selection across rule and exemplar memory, and prunes via evidence-based retirement inside a unified control loop. On SVAMP and ASDiv, TAG reaches 81.0% and 85.2% accuracy, improving over the 74.0% and 77.5% no-memory baselines while a compute-matched Retry baseline stays flat.

arXiv:2604.18206 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…