Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models

AuthorsJiaqi Cao, Jiarui Wang, Rubin Wei et al.

2025

TL;DR

Memory Decoder uses a pretrained transformer memory that mimics kNN-LM distributions, cutting average domain perplexity by 6.17 points across Qwen and Llama models.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Domain Adaptation Without Catastrophic Forgetting or Retrieval Latency

Domain Adaptive Pretraining on Wikitext-103 drives GPT2-small perplexity from 24.89 to 14.76, but requires costly full-parameter training and causes forgetting on tasks like Yahoo.

Retrieval-Augmented Generation with kNN-LM cuts GPT2-small perplexity to 15.62 yet needs datastores like Wikitext-103 that reach nearly 500GB and yield 2.17× latency, making deployment impractical.

HOW IT WORKS

Memory Decoder: Distribution-Aligned Plug-and-Play Memory

Memory Decoder uses a Pre-training pipeline that builds a key value datastore, caches kNN distributions, and trains a small transformer Memory Decoder with a hybrid Distribution Alignment Loss and Language Modeling objective.

You can think of Memory Decoder as compressing a huge external card catalog of kNN neighbors into a compact, always-attached RAM module that sits beside the frozen LLM.

By interpolating Memory Decoder’s logits with the base LLM during Inference, Memory Decoder injects domain knowledge that a plain context window or standard fine-tuning cannot provide without extra latency or parameter changes.

DIAGRAM

Inference Flow with Memory Decoder and Base LLM

This diagram shows how Memory Decoder and a frozen pretrained language model run in parallel at inference and interpolate their output distributions.

DIAGRAM

Training and Evaluation Pipeline for Memory Decoder

This diagram shows how Memory Decoder is trained from kNN-LM distributions and then evaluated across Wikitext-103 and domain corpora.

PROCESS

How Memory Decoder Handles Domain Adaptation

01
Data Construction
Memory Decoder builds a key value datastore from domain corpus Dtrain using hidden representations ϕ(xi) and caches kNN distributions pkNN(·|xi) for each context.
02
Pre-training Objective
Memory Decoder optimizes a hybrid loss combining Distribution Alignment Loss LKL and Language Modeling objective LLM with balance parameter β set to 0.5.
03
Inference
During inference, Memory Decoder processes the same context as MPLM in parallel and produces pMem(yt|x) without any retrieval or datastore access.
04
Interpolation
Memory Decoder interpolates distributions via pMem-PLM(yt|x) = α pMem + (1−α) pPLM, enabling plug and play domain adaptation across Qwen and Llama families.

KEY CONTRIBUTIONS

Key Contributions

01
Memory Decoder
Memory Decoder introduces a plug and play pretrained memory that adapts large language models to domains like biomedicine, finance, and law without modifying base parameters, reducing average perplexity by 6.17 points.
02
Parametric Retriever Replacement
Memory Decoder replaces traditional non parametric retrievers by training a compact transformer decoder with a Distribution Alignment Loss to mimic kNN distributions while avoiding datastore storage and kNN search.
03
Cross Model and Cross Vocabulary Adaptation
Memory Decoder demonstrates generalizability by using a single 0.5B memory to enhance Qwen2 and Qwen2.5 models from 0.5B to 72B, and by adapting a Qwen trained memory to Llama3 families with only 10 percent extra training.

RESULTS

By the Numbers

Perplexity GPT2 small

13.36

-1.40 vs DAPT on Wikitext-103

Perplexity Qwen2 0.5B Avg

4.05

-10.83 vs base 14.88 across Bio Fin Law

Latency Overhead

1.28×

-0.89× vs kNN-LM 2.17× on Qwen2.5-1.5B

Downstream Avg Score

69.79

+2.34 over base 67.45 on nine NLP tasks

On Wikitext-103 and specialized corpora like MIMIC-III, financial news, and Asylex, these metrics show that Memory Decoder matches or beats DAPT and kNN-LM while keeping base parameters frozen and latency close to the original LLM.

BENCHMARK

By the Numbers

BENCHMARK

Perplexity comparison of domain adaptation methods across GPT2 model sizes on Wikitext-103

Perplexity on Wikitext-103 for GPT2-small with different domain adaptation methods.

BENCHMARK

Cross-model adaptation results across three specialized domains

Average perplexity across biomedicine, finance, and law for Qwen2-0.5B with different adapters.

KEY INSIGHT

The Counterintuitive Finding

Even with only 124M parameters, Memory Decoder on GPT2-medium reaches 12.25 perplexity, beating full DAPT at 12.78 without touching base weights.

This is surprising because conventional wisdom says full parameter Domain Adaptive Pretraining should dominate any small side module, yet Memory Decoder’s kNN aligned training objective yields a 0.53 perplexity edge while remaining plug and play.

WHY IT MATTERS

What this unlocks for the field

Memory Decoder makes it possible to train one domain specific memory once and reuse it across entire model families like Qwen2.5 and Llama3 without retraining each base model.

Builders can now ship domain adapted LLMs that keep general abilities, avoid catastrophic forgetting, and sidestep massive kNN datastores, enabling practical deployment of specialized assistants in biomedicine, finance, and law.

~14 min read← Back to papers

Related papers

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

RAG

A Dynamic Retrieval-Augmented Generation System with Selective Memory and Remembrance

Okan Bursa

· 2026

Adaptive RAG Memory (ARM) augments a standard retriever–generator stack with a Dynamic Embedding Layer and Remembrance Engine that track usage statistics and apply selective remembrance and decay to embeddings. On a lightweight retrieval benchmark, ARM achieves NDCG@5 ≈ 0.9401 and Recall@5 = 1.000 with 22M parameters, matching larger baselines like gte-small while providing the best efficiency among ultra-efficient models.

arXiv:2601.02428 Read explainer

RAGLong-Term Memory

HingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable Dialogues

Yijie Zhong, Yunfan Gao, Haofen Wang

· 2026

HingeMem combines Boundary Guided Long-Term Memory, Dialogue Boundary Extraction, Memory Construction, Query Adaptive Retrieval, Hyperedge Rerank, and Adaptive Stop to segment dialogues into element-indexed hyperedges and plan query-specific retrieval. On LOCOMO, HingeMem achieves 63.9 overall F1 and 75.1 LLM-as-a-Judge score, surpassing the best baseline Zep (56.9 F1) by 7.0 F1 without using category-specific QA formats.

arXiv:2604.06845 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…