A Dynamic Retrieval-Augmented Generation System with Selective Memory and Remembrance

AuthorsOkan Bursa

2026

TL;DR

Adaptive RAG Memory (ARM) uses selective remembrance with multiplicative decay to build a dynamic RAG memory that reaches NDCG@5 = 0.9401 with Recall@5 = 1.000 using only 22M parameters.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Static RAG memories never forget or adapt (NDCG@5 ≈0.940 vs larger static baselines)

Conventional RAG systems rely on a static vector index that never decays or consolidates, even as domains and user behavior change. This forces expensive re-indexing and overweights stale content instead of emphasizing frequently used knowledge.

Adaptive RAG Memory (ARM) targets retrieval-augmented generation pipelines where static memories limit adaptability and efficiency. The consequence is wasted memory, higher latency, and no principled way to retain high-value facts while letting obsolete material fade.

HOW IT WORKS

Adaptive RAG Memory with Selective Remembrance and Decay

Adaptive RAG Memory (ARM) introduces a Dynamic Embedding Layer and Remembrance Engine that maintain per-item counts, timestamps, and remembered flags on top of a dense Retriever and Generator. ARM applies Algorithm 1 Selective Remembrance and Decay using a configurable remembrance threshold, grace period, and decay rate.

You can think of ARM like a digital hippocampus: frequently retrieved passages are consolidated into long-term memory, while rarely accessed embeddings gradually weaken, similar to synaptic decay. The Dynamic Embedding Layer acts as working memory, and the Remembrance Engine decides what gets promoted or forgotten.

This selective remembrance and multiplicative decay let ARM reshape the retrieval store itself, something a plain context window or static index cannot do. ARM keeps capacity focused on high-utility content without retraining the generator or rebuilding the index.

DIAGRAM

Per Query Selective Remembrance and Decay Flow

This diagram shows how Adaptive RAG Memory (ARM) applies Algorithm 1 Selective Remembrance and Decay to update usage statistics and decay stale embeddings after each query.

DIAGRAM

Evaluation and Ablation Pipeline for ARM

This diagram shows how Adaptive RAG Memory (ARM) is evaluated across NQ, HotpotQA, and a domain corpus with decay hyperparameter ablations and RAG system comparisons.

PROCESS

How Adaptive RAG Memory Handles a Query Lifecycle

01
Encode query
Adaptive RAG Memory (ARM) encodes the user query into a dense vector and feeds it to the Retriever backed by FAISS for approximate nearest neighbor search.
02
Retrieve top k items
ARM uses the Retriever to obtain top k passages from the Dynamic Embedding Layer and passes them to the Generator for answer conditioning.
03
Generate answer
The Generator (e.g., Llama 3.1 or GPT-4o) consumes retrieved passages and produces an answer while ARM keeps the retrieval store generator-agnostic.
04
Apply remembrance and decay
ARM runs Algorithm 1 in the Remembrance Engine, updating counts, setting remembered flags when ci ≥ θ, and applying multiplicative decay Ej ← α·Ej after the grace period γ.

KEY CONTRIBUTIONS

Key Contributions

01
Dynamic Embedding Layer for online index adaptation
Adaptive RAG Memory (ARM) introduces a Dynamic Embedding Layer that tracks Ei, ci, τi, and remembered flags per item, enabling online index adaptation with ∼22M parameters in the embedding layer.
02
Selective remembrance and decay policy
ARM implements a Remembrance Engine with a usage-governed remembrance threshold θ, grace period γ, and decay rate α, providing interpretable control over consolidation and forgetting in non-parametric memory.
03
End-to-end comparison of static vs dynamic RAG
ARM is evaluated in full RAG systems, showing Llama 3.1 with static RAG reaching 67.2% key-term coverage and GPT-4o with dynamic selective retrieval achieving 8.2s average response time with 58.7% coverage.

RESULTS

By the Numbers

N@5

0.9401

-0.0223 vs gte-small (0.9624)

P@5

0.5333

same as gte-small and all-MiniLM-L6-v2 at 0.5333

R@5

1.0000

perfect recall with <25M parameters on the lightweight benchmark

Params

22M

11M fewer than gte-small and bge-small-en-v1.5 at 33M

These results come from the lightweight retrieval benchmark summarized in Table II, which reports NDCG@5, Precision@5, Recall@5, and parameter counts. The main result shows that Adaptive RAG Memory (ARM) matches or closely trails larger dense retrievers while achieving the best efficiency among ultra-efficient (<25M parameter) models.

BENCHMARK

By the Numbers

BENCHMARK

Retrieval Performance Summary (Table II)

N@5 (NDCG@5) scores for ARM and compact dense retrievers on the lightweight benchmark.

KEY INSIGHT

The Counterintuitive Finding

Adaptive RAG Memory (ARM) achieves Recall@5 = 1.0000 and NDCG@5 = 0.9401 with only 22M embedding parameters, matching all-MiniLM-L6-v2. At the same time, ARM maintains the best efficiency among ultra-efficient models, with NDCG per parameter higher than the 33M-parameter gte-small and bge-small-en-v1.5.

This is surprising because one might expect dynamic decay and consolidation to hurt retrieval quality compared to larger static encoders. Instead, ARM shows that a usage-aligned dynamic memory can keep perfect recall and competitive NDCG while using fewer parameters and self-regularizing memory growth.

WHY IT MATTERS

What this unlocks for the field

Adaptive RAG Memory (ARM) unlocks dynamic, neuroscience-inspired RAG memories where frequently retrieved knowledge is consolidated and stale content decays without retraining the generator. Builders can now deploy RAG systems that adapt their retrieval store online, trading off quality, latency, and memory footprint via interpretable hyperparameters like θ, γ, and α.

This makes continual adaptation practical in domains with evolving corpora, enabling ARM-based systems to maintain compact, high-utility memories instead of endlessly growing static indices. Developers can experiment with conservative, balanced, or aggressive profiles to match safety-critical, production, or exploratory settings.

~11 min read← Back to papers

Related papers

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

RAGLong-Term Memory

HingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable Dialogues

Yijie Zhong, Yunfan Gao, Haofen Wang

· 2026

HingeMem combines Boundary Guided Long-Term Memory, Dialogue Boundary Extraction, Memory Construction, Query Adaptive Retrieval, Hyperedge Rerank, and Adaptive Stop to segment dialogues into element-indexed hyperedges and plan query-specific retrieval. On LOCOMO, HingeMem achieves 63.9 overall F1 and 75.1 LLM-as-a-Judge score, surpassing the best baseline Zep (56.9 F1) by 7.0 F1 without using category-specific QA formats.

arXiv:2604.06845 Read explainer

BenchmarkRAG

Knowledge Packs: Zero-Token Knowledge Delivery via KV Cache Injection

Andrey Pustovit

· 2026

Knowledge Packs pre-compute KV Cache Injection, KV–Prefix Equivalence, Banked Routing, and KV Composition to deliver retrieved knowledge and steering via KV states instead of prompt tokens. On HotpotQA, Knowledge Packs’ KV-chat matches RAG at 65.2% EM on Qwen3-8B with 0/500 divergences while eliminating 284 tokens of retrieval text per query.

arXiv:2604.03270 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…