MemLong: Memory-Augmented Retrieval for Long Text Modeling

AuthorsWeijie Liu, Zecheng Tang, Juntao Li et al.

2024

TL;DR

MemLong uses a non-trainable chunk-level memory bank plus retrieval causal attention to extend OpenLLaMA-3B from 4k to 80k tokens while improving long-context perplexity.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Long-context LLMs break under quadratic attention and KV cache growth

MemLong targets the fact that vanilla attention has quadratic time and space complexity and KV cache memory explodes as context grows.

This makes long-document summarization and multi-round dialogue brittle, and OpenLLaMA-3B even hits > 10^3 perplexity beyond 4k tokens due to context and memory limits.

HOW IT WORKS

MemLong: Memory-Augmented Retrieval for Long Text Generation

MemLong introduces a Ret-Mem module with a Retriever, Memory Bank, Retrieval Causal Attention, and Dynamic Memory Update on top of a partially frozen decoder-only LLM.

Think of the Memory Bank as disk storing compact chunk embeddings and K-V pairs, while Retrieval Causal Attention is fast RAM that selectively loads only the most relevant past chunks.

This architecture lets MemLong attend over semantically relevant 80k-token histories via retrieved K-V pairs, instead of pushing everything through a single fragile context window.

DIAGRAM

MemLong Inference Flow for Long Inputs

This diagram shows how MemLong encodes long prefixes into memory and retrieves chunk-level K-V pairs during generation.

DIAGRAM

MemLong Training and Ablation Design

This diagram shows how MemLong trains upper layers with different retrieval layer configurations and evaluates on long-context benchmarks.

PROCESS

How MemLong Handles Long-Context Language Modeling

  1. 01

    Task Definition

    MemLong reformulates language modeling as p(xi | R(ti), x<i), where Retriever R operates over chunked text T and Memory Bank stores chunk-level context.

  2. 02

    Retriever and Dynamic Memory Management

    MemLong uses Retriever to encode each text chunk, performs cosine similarity search over Memory Bank, and applies Dynamic Memory Update based on recency and retrieval frequency.

  3. 03

    Attention Reformulation

    MemLong introduces Retrieval Causal Attention in upper layers, combining local causal scores Sa with retrieved memory scores Sm to produce fused hidden states.

  4. 04

    Inference with MemLong

    MemLong splits long inputs into prefix and main, encodes prefix into Memory Bank, retrieves top k chunk K-V pairs, and runs Retrieval Causal Attention during generation.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    MemLong framework

    MemLong introduces a Ret-Mem module with a non-trainable Memory Bank and frozen Retriever to store chunk-level K-V pairs and embeddings without distribution shift.

  • 02

    Retrieval Causal Attention

    MemLong proposes Retrieval Causal Attention that fuses local causal attention with chunk-level retrieved K-V pairs, enabling effective use of long-range semantics.

  • 03

    Extensive context window

    MemLong extends OpenLLaMA-3B context from 4k to 80k tokens on a single 3090 GPU by storing only one layer’s K-V pairs and using Dynamic Memory Update.

RESULTS

By the Numbers

PG19 PPL 16k

9.73

-0.64 vs MemLong-3B* without Memory (10.37)

ProofPile PPL 16k

2.99

-0.19 vs MemLong-3B* without Memory (3.18)

BookCorpus PPL 16k

9.54

-0.83 vs MemLong-3B* without Memory (10.37 at 2k proxy)

ICL Avg Accuracy

73.4%

+4.4 points over OpenLLaMA 20-shot (69.0%) on 5 NLU tasks

On PG19, Proof-Pile, BookCorpus, and Wikitext-103, MemLong with 32K Memory reduces sliding-window perplexity at lengths up to 16k tokens. On SST-2, MR, Subj, SST-5, and MPQA in-context learning, MemLong reaches 73.4% average accuracy with 20 in-context and 18 in-memory demonstrations, showing that MemLong effectively leverages stored examples.

BENCHMARK

By the Numbers

On PG19, Proof-Pile, BookCorpus, and Wikitext-103, MemLong with 32K Memory reduces sliding-window perplexity at lengths up to 16k tokens. On SST-2, MR, Subj, SST-5, and MPQA in-context learning, MemLong reaches 73.4% average accuracy with 20 in-context and 18 in-memory demonstrations, showing that MemLong effectively leverages stored examples.

BENCHMARK

Sliding window perplexity on PG19 at 16k tokens (3B models)

Perplexity on PG19 at 16k tokens (lower is better).

BENCHMARK

20-shot ICL average accuracy on 5 NLU tasks

Average accuracy across SST-2, MR, Subj, SST-5, MPQA with 20 in-context demonstrations.

KEY INSIGHT

The Counterintuitive Finding

MemLong shows that retrieving into all upper layers (RA + TA) hurts performance, giving PG19 PPL 11.40 at 4k vs 9.83 for selective RLP + TH.

This is surprising because one might expect more retrieval layers to always help, but MemLong reveals that too much retrieval distracts from local context.

WHY IT MATTERS

What this unlocks for the field

MemLong unlocks practical 80k-token context on a single 3090 GPU by storing only one layer’s K-V pairs plus compact embeddings.

Builders can now run long-document modeling, retrieval-augmented in-context learning, and multi-round dialogue over book-length histories without retraining full LLMs or blowing up KV cache memory.

~13 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

Questions about this paper?

Paper: MemLong: Memory-Augmented Retrieval for Long Text Modeling

Answers use this explainer on Memory Papers.

Checking…