Augmenting Language Models with Long-Term Memory

AuthorsWeizhi Wang, Li Dong, Hao Cheng et al.

2023

TL;DR

LONGMEM uses a frozen LLM encoder plus a residual SideNet with decoupled long-term memory retrieval to reach 40.5% accuracy on ChapterBreak AO3 vs 28.3% for MemTRM (+12.2 points).

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

LLMs Forget Long Context Beyond Fixed Windows

Existing LLMs are constrained by a fixed input length, so they cannot utilize rich long-context information from past inputs.

This limitation breaks long-text language modeling and many-shot in-context learning, where long-form memory, book-level context, and thousands of demonstrations are crucial for performance.

HOW IT WORKS

LONGMEM — Decoupled Long-Term Memory with Residual SideNet

LONGMEM combines a frozen backbone LLM, a trainable Residual SideNet, a Cached Memory Bank, and a Memory Retrieval and Fusion module to read long-term key–value memories.

You can think of the frozen backbone as a read-only CPU encoding keys into a disk-like memory bank, while the SideNet is RAM that selectively loads and fuses relevant blocks.

This decoupled design with Cross-Network Residual Connections lets LONGMEM use unlimited-length memory without staleness, going far beyond what a plain context window can represent.

DIAGRAM

Memory Caching and Retrieval Flow in LONGMEM

This diagram shows how LONGMEM caches attention key–value pairs from past segments and retrieves top K memories for current tokens.

DIAGRAM

Training and Evaluation Pipeline for LONGMEM

This diagram shows how LONGMEM is trained on The Pile subset and evaluated on PG-22, ArXiv, ChapterBreak, and NLU in-context learning.

PROCESS

How LONGMEM Handles a Long-Context Language Modeling Session

01
Language Models Augmented with Long-Term Memory
LONGMEM first runs the frozen backbone LLM on previous inputs, extracting mth layer attention key–value pairs into the Cached Memory Bank while keeping decoder hidden states for current inputs.
02
Residual SideNet
LONGMEM initializes the Residual SideNet from every second backbone layer and uses Cross-Network Residual Connections to inject backbone hidden state differences into SideNet layers.
03
Token-to-Chunk Memory Retrieval
For each current token, LONGMEM uses the Memory Retrieval and Fusion module to perform token-to-chunk retrieval over the Cached Memory Bank, selecting top K key–value chunks via dot product.
04
Memory Fusion
Within the memory-augmented layer, LONGMEM applies joint-attention with a trainable gating vector to fuse local self-attention outputs and retrieved memory, then predicts next tokens via the shared language model head.

KEY CONTRIBUTIONS

Key Contributions

01
Language Models Augmented with Long-Term Memory
LONGMEM introduces a decoupled design with a frozen backbone LLM, Residual SideNet, and Cached Memory Bank so attention key–value memories can be cached and reused without staleness.
02
Residual SideNet
LONGMEM designs a lightweight Residual SideNet with Cross-Network Residual Connections, initialized from GPT-2* layers, enabling efficient memory-augmented adaptation without catastrophic forgetting.
03
Memory Retrieval and Fusion
LONGMEM proposes token-to-chunk Memory Retrieval and Fusion with a joint-attention mechanism, supporting up to 65k tokens in memory and delivering up to -1.62 perplexity gains on PG-22.

RESULTS

By the Numbers

PG-22 S1 PPL

21.29

-0.48 over MemTRM on 5K-10K split

PG-22 S2 PPL

23.01

-0.55 over MemTRM on 10K-100K split

ArXiv PPL

10.05

-0.76 over MemTRM with 65k in-memory length

ChapterBreak AO3 Acc

40.5%

+12.2 points over MemTRM and +12.5 over GPT-3

On PG-22 and ArXiv long-text language modeling, LONGMEM reduces perplexity compared to GPT-2* and MemTRM while using 1k in-context and 65k in-memory tokens. On ChapterBreak AO3, LONGMEM reaches 40.5% suffix identification accuracy versus 28.3% for MemTRM, showing that LONGMEM can actually leverage book-level context stored in memory.

BENCHMARK

By the Numbers

BENCHMARK

ChapterBreak AO3 Suffix Identification Accuracy

Suffix identification accuracy on ChapterBreak AO3 with 4k 6k 8k prefix contexts.

BENCHMARK

Average NLU In-Context Learning Accuracy (20-shot)

Average accuracy on SST-2, MR, Subj, SST-5, MPQA with 20-shot in-context learning and 2000 in-memory demonstrations.

KEY INSIGHT

The Counterintuitive Finding

Despite using the same 1k in-context window as GPT-2*, LONGMEM reaches 40.5% accuracy on ChapterBreak AO3, compared to 18.4% for GPT-2*.

This is surprising because conventional wisdom suggests longer in-context windows or larger models like GPT-3 are required, yet LONGMEM beats GPT-3’s 28% with only 558M parameters.

WHY IT MATTERS

What this unlocks for the field

LONGMEM unlocks practical long-term memory for LLMs, enabling 65k-token memory banks and many-shot in-context learning with thousands of cached demonstrations.

Builders can now adapt existing frozen LLMs into long-context systems that read entire books or training sets from memory, without retraining huge transformers or extending attention windows.

~12 min read← Back to papers

Related papers

Agent MemoryLong-Term Memory

Adaptive Memory Admission Control for LLM Agents

Guilin Zhang, Wei Jiang et al.

· 2026

A-MAC scores candidate memories using Utility, Confidence, Novelty, Recency, and Type Prior combined by a learned linear admission policy with Algorithm 1 A-MAC Memory Admission. On the LoCoMo benchmark, A-MAC achieves F1 0.583 and 2644 ms latency, improving F1 by 0.042 and reducing latency by 1187 ms compared to A-mem.

arXiv:2603.04549 Read explainer

Long-Term Memory

Advancing Open-source World Models

Robbyant Team, Zelin Gao et al.

arXiv 2026 · 2026

LingBot-World combines a Data Engine, Fundamental World Model, Action-Conditioned World Model, and Post-Training causal adaptation to turn a 28B-parameter video generator into a real-time interactive world simulator. On the VBench benchmark, LingBot-World achieves a dynamic degree of 0.8857 versus 0.7612 for Yume-1.5, while also improving imaging quality to 0.6683.

arXiv:2601.20540 Read explainer

BenchmarkBenchmarkLong-Term Memory

AgenticAI-DialogGen: Topic-Guided Conversation Generation for Fine-Tuning and Evaluating Short- and Long-Term Memories of LLMs

Manoj Madushanka Perera, Adnan Mahmood et al.

· 2026

AgenticAI-DialogGen chains ChatPreprocessor, KnowledgeExtractor, TopicAnalyzer, KnowledgeGraphBuilder, PersonaGenerator, DuelingChat Agent, ConversationValidator, ConversationRefiner, QAGeneration, and PostProcessing to turn raw multi-session chats into topic-guided, persona-grounded conversations with explicit short- and long-term memories. On the TGC / KG memory QA benchmark, Mistral-7B fine-tuned within AgenticAI-DialogGen achieves 87.36 F1, compared to GPT-4’s 83.77 F1 in a zero-shot setting on the same task.

arXiv:2604.12179 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…