TransformerFAM: Feedback attention is working memory

AuthorsDongseong Hwang, Weiran Wang, Zhuoyuan Huo et al.

2024

TL;DR

TransformerFAM adds a feedback attention memory inside each Transformer block, enabling indefinite working memory and perfect PassKey retrieval up to 260k tokens.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Transformers Forget Beyond Their Receptive Field Depth × Window Size

Sliding Window and Block Sliding Window Attention give Transformers linear complexity but limit the effective receptive field to approximately model depth × window size.

For long-context tasks like NarrativeQA and ScrollsQasper, this means later tokens cannot use prompt information, breaking reasoning over documents with 5k to 500k tokens.

HOW IT WORKS

TransformerFAM — Feedback Attention as Working Memory

TransformerFAM augments Block Sliding Window Attention (BSWA) with Feedback Attention Memory (FAM), modifying SelfAttention and QKV projections to jointly process block tokens, memory segments, and FAM.

You can think of BSWA as local cache lines and FAM as a tiny, persistent working RAM that is repeatedly updated and read, similar to a cortical-thalamic loop.

This feedback attention lets TransformerFAM compress each block into FAM and propagate it indefinitely, so later tokens can still use information far beyond any fixed context window.

DIAGRAM

Blockwise Inference Flow with Feedback Attention Memory

This diagram shows how TransformerFAM processes each block during inference, updating Feedback Attention Memory while attending over BSWA memory segments.

DIAGRAM

Training and Evaluation Pipeline for Long Context Tasks

This diagram shows how TransformerFAM is fine tuned with Flan instruction data and then evaluated on long context benchmarks.

PROCESS

How TransformerFAM Handles Long Context Sequence Processing

01
Block Sliding Window Attention
TransformerFAM first applies Block Sliding Window Attention, using block size 1024 and memory segments to cache past keys and values as in TransformerBSWA.
02
Feedback Attention Memory Initialization
TransformerFAM initializes Feedback Attention Memory with a short sequence, for example FAM length 64, representing global context before processing the first block.
03
Feedback Attention Memory Update
Within each block, TransformerFAM copies the previous FAM as a query, attends over current block and FAM keys and values, and updates Fτ via FF layers.
04
Length Extrapolation and Inference
TransformerFAM repeatedly applies BSWA plus FAM updates across blocks, enabling O(L) compute and O(1) memory while propagating information over an indefinite horizon.

KEY CONTRIBUTIONS

Key Contributions

01
Feedback Attention Memory Architecture
TransformerFAM introduces Feedback Attention Memory inside each Transformer layer, combining BSWA memory segments and FAM to implement distributed working memory without adding new weights.
02
Indefinite Horizon Working Memory
TransformerFAM achieves O(L) compute and O(1) memory during inference, and perfectly solves PassKey retrieval up to 260k filler tokens with FAM length 64.
03
Scalable Long Context Improvements
TransformerFAM improves long context tasks like ScrollsQasper from 12.4 to 18.5 on 8B and from 28.0 to 29.4 on 24B parameters, while slightly improving GPT3 benchmarks.

RESULTS

By the Numbers

ScrollsQasper score

18.5

+6.1 over TransformerBSWA 8B

ScrollsQasper score

29.4

+1.4 over TransformerBSWA 24B

GPT3 Rank

74.0

+1.2 over TransformerBSWA 8B

PassKey accuracy

1.0

perfect up to 260k filler tokens

On long context tasks like ScrollsQasper, which require reasoning over 5k to 500k tokens, TransformerFAM raises the 8B score from 12.4 to 18.5 and the 24B score from 28.0 to 29.4. Combined with perfect PassKey retrieval up to 260k tokens, this shows TransformerFAM maintains and uses information far beyond the effective receptive field of TransformerBSWA.

BENCHMARK

By the Numbers

BENCHMARK

Long Context Tasks: ScrollsQasper Scores for 8B and 24B Models

ScrollsQasper scores comparing TransformerFAM and TransformerBSWA at 8B and 24B parameters.

KEY INSIGHT

The Counterintuitive Finding

TransformerFAM slightly improves GPT3 tasks, for example GPT3 Rank from 72.8 to 74.0 on 8B, even though all sequences are shorter than 2k tokens.

This is surprising because feedback working memory is designed for very long contexts, yet TransformerFAM also yields better short context representations by offloading global information into FAM.

WHY IT MATTERS

What this unlocks for the field

TransformerFAM shows that adding feedback attention inside each layer can implement working memory with O(1) activation footprint over arbitrarily long sequences.

Builders can now adapt existing Flan PaLM style checkpoints into TransformerFAM, gaining indefinite context processing for tasks like NarrativeQA, ScrollsQasper, and PG19 without adding new parameters.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…