Category

RAG

Research on retrieval-augmented generation (RAG) and non-parametric memory for language models.

33 papers

RAG

A Dynamic Retrieval-Augmented Generation System with Selective Memory and Remembrance

Okan Bursa

· 2026

Adaptive RAG Memory (ARM) augments a standard retriever–generator stack with a Dynamic Embedding Layer and Remembrance Engine that track usage statistics and apply selective remembrance and decay to embeddings. On a lightweight retrieval benchmark, ARM achieves NDCG@5 ≈ 0.9401 and Recall@5 = 1.000 with 22M parameters, matching larger baselines like gte-small while providing the best efficiency among ultra-efficient models.

RAGLong-Term Memory

HingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable Dialogues

Yijie Zhong, Yunfan Gao, Haofen Wang

· 2026

HingeMem combines Boundary Guided Long-Term Memory, Dialogue Boundary Extraction, Memory Construction, Query Adaptive Retrieval, Hyperedge Rerank, and Adaptive Stop to segment dialogues into element-indexed hyperedges and plan query-specific retrieval. On LOCOMO, HingeMem achieves 63.9 overall F1 and 75.1 LLM-as-a-Judge score, surpassing the best baseline Zep (56.9 F1) by 7.0 F1 without using category-specific QA formats.

RAGBenchmarkLong-Term Memory

MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents

Shu Wang, Edwin Yu et al.

· 2026

MemMachine combines Short-term memory, Long-term memory, Profile memory, and the Retrieval Agent to store raw conversational episodes and retrieve clustered context around nucleus matches. On LoCoMo, MemMachine scores 0.9169 with gpt-4.1-mini while using about 80% fewer input tokens than Mem0, and reaches 93.0% on LongMemEvalS with GPT-5-mini.

RAGAgent MemoryLong-Term MemoryMemory Architecture

Memory as Metabolism: A Design for Companion Knowledge Systems

Stefan Miteski

· 2026

Memory as Metabolism defines companion knowledge systems with five retention operations (TRIAGE, DECAY, CONTEXTUALIZE, CONSOLIDATE, AUDIT) plus memory gravity and minority-hypothesis retention over a raw buffer, active wiki, and cold memory. Instead of benchmark gains, Memory as Metabolism’s main result is a governance specification that separates descriptive, taxonomic, and normative claims and predicts improved coherence stability, fragility resistance, monoculture resistance, and effective minority-hypothesis influence for companion wikis.

SurveyRAGAgent Memory

Memory for Autonomous LLM Agents:Mechanisms, Evaluation, and Emerging Frontiers

Pengfei Du

· 2026

Memory for Autonomous LLM Agents decomposes agent memory into a POMDP-grounded write–manage–read loop, a three-dimensional taxonomy, and five mechanism families spanning context compression, retrieval stores, reflection, hierarchical virtual context, and policy-learned management. Memory for Autonomous LLM Agents synthesizes results like Voyager’s 15.3× tech-tree speedup and MemoryArena’s 80%→45% drop to show that memory architecture often matters more than backbone choice.

RAGBenchmarkBenchmarkBenchmarkAgent MemoryLong-Term MemoryMemory Architecture

Evaluating Long-Term Memory for Long-Context Question Answering

Alessandra Terranova, Björn Ross, Alexandra Birch

· 2025

Evaluating Long-Term Memory for Long-Context Question Answering compares Full Context, RAG, A-Mem, RAG+PromptOpt, and RAG+EpMem memory components across semantic, episodic, and procedural memory for long conversational QA. On LoCoMo, RAG+EpMem reaches an average F1 ranking of 1.83 for Llama 3.2-3B Instruct and 1.80 for GPT-4o mini while using around 1,000 tokens per query versus over 23,000 for Full Context.

PickRAGBenchmark

From RAG to Memory: Non-Parametric Continual Learning for Large Language Models

Bernal Jiménez Gutiérrez, Yiheng Shu et al.

ICML 2025 · 2025

HippoRAG 2 combines Offline Indexing, a schema-less Knowledge Graph, Dense-Sparse Integration, Deeper Contextualization, and Recognition Memory into a neuro-inspired non-parametric memory system for LLMs. On the joint RAG benchmark suite, HippoRAG 2 achieves 59.8 average F1 versus 57.0 for NV-Embed-v2, including 71.0 F1 on 2Wiki compared to 61.5 for NV-Embed-v2.

RAGBenchmarkBenchmarkMemory Architecture

Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling

Chulun Zhou, Chunkang Zhang et al.

· 2025

HGMEM represents working memory as a hypergraph with Hypergraph-based Memory Storage, Adaptive Memory-based Evidence Retrieval, and Dynamic Memory Evolving to build high-order correlations across entities and facts. On Prelude long narrative understanding, HGMEM with GPT-4o achieves 73.81% accuracy compared to 72.22% for HippoRAG v2, while also reaching 69.74 comprehensiveness on Longbench generative sense-making QA.

RAGBenchmarkBenchmarkMemory Architecture

Learning from Supervision with Semantic and Episodic Memory: A Reflective Approach to Agent Adaptation

Jackson Hassell, Dan Zhang et al.

· 2025

Learning from Supervision with Semantic and Episodic Memory combines a performance agent, critic agent, semantic memory, episodic memory, and memory retriever to turn label-grounded critiques into reusable supervision without parameter updates. On the Multi-Condition Ranking dataset with Mixtral 8x22B and o4-mini as critic, Learning from Supervision with Semantic and Episodic Memory reaches 85.6% accuracy, a 24.8% gain over the EP_LABEL baseline at 60.8%.

RAG

LLM-Empowered Embodied Agent for Memory-Augmented Task Planning in Household Robotics

Marc Glocker, Peter Hönig et al.

· 2025

LLM-Empowered Embodied Agent for Memory-Augmented Task Planning in Household Robotics coordinates a routing agent, task planning agent, and knowledge base agent over RAG and ChromaDB to translate household commands into grounded robot actions. In three tabletop scenarios, Qwen2.5-32B in LLM-Empowered Embodied Agent for Memory-Augmented Task Planning in Household Robotics achieves 84.3% total lenient task planning accuracy versus 68.7% for Gemma2-27B and 61.1% for LLaMa3.1-8B.

RAG

MemInsight: Autonomous Memory Augmentation for LLM Agents

Rana Salama, Jason Cai et al.

· 2025

MemInsight augments agent memory using Attribute Mining, Annotation and Attribute Prioritization, and Memory Retrieval modules that generate and exploit structured attributes over past interactions. On the LoCoMo question answering benchmark, MemInsight with Claude-3-Sonnet priority augmentation achieves 60.5% Recall@5 versus 26.5% for DPR, a 34.0-point improvement.

RAGBenchmarkBenchmarkMemory Architecture

Memory-Augmented Log Analysis with Phi-4-mini: Enhancing Threat Detection in Structured Security Logs

Anbi Guo, Mahfuza Farooque

· 2025

DM-RAG augments Phi-4-mini with a Short-Term Memory (STM) buffer, Long-Term Memory (LTM) FAISS store, Bayesian fusion, and a logistic regression confidence model for structured log analysis. On UNSW-NB15, DM-RAG reaches 98.70% recall and 69.59% F1, beating the Phi-4 + RAG (MITRE) baseline in F1 by 17.89 points.

RAGBenchmarkMemory Architecture

Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models

Jiaqi Cao, Jiarui Wang et al.

· 2025

Memory Decoder combines a Pre-training stage that aligns with kNN-LM distributions and an Inference interpolation mechanism that mixes Memory Decoder and base LLM outputs without changing base parameters. On Wikitext-103, Memory Decoder with 124M parameters reaches 13.36 perplexity on GPT2-small versus 14.76 for DAPT, and on specialized domains a single 0.5B Memory Decoder reduces average perplexity from 14.88 to 4.05 on Qwen2-0.5B.

RAGBenchmarkAgent Memory

Memory in the Age of AI Agents

Yuyang Hu, Shichun Liu et al.

· 2025

Memory in the Age of AI Agents formalizes agent memory with Memory Formation, Memory Evolution, and Memory Retrieval operators, and classifies memories into token-level, parametric, and latent forms plus factual, experiential, and working functions. Memory in the Age of AI Agents’ main result is a unified Forms–Functions–Dynamics framework that consolidates fragmented LLM agent memory work, benchmarks, and open-source frameworks into a coherent taxonomy.

BenchmarkRAG

MEPIC: Memory Efficient Position Independent Caching for LLM Serving

Qian Wang, Zahra Yousefijamarani et al.

· 2025

MEPIC extends vLLM with a Chunk Cache Coordinator, Chunk Matcher, Hybrid KV Manager, Chunk LRU Manager, and Chunk Processor to manage canonical, page-aligned, position-independent KV chunks in HBM. On long-context workloads, MEPIC reduces HBM usage by up to 5.21× and lowers latency by up to 11.48% compared to CacheBlend on Mistral-7B-Instruct-v0.3.

RAG

MeVe: A Modular System for Memory Verification and Effective Context Control in Language Models

Andreas Ottem

· 2025

MeVe decomposes retrieval into Initial Retrieval, Relevance Verification, Fallback Retrieval, Context Prioritization, and Token Budgeting to tightly control what enters the LLM context. On a Wikipedia subset and HotpotQA, MeVe reduces average context from 188.8 to 79.8 tokens and from 308.6 to 78.5 tokens respectively compared to Standard RAG while keeping retrieval time comparable.

RAGLong-Term MemoryMemory Architecture

Mnemosyne: An Unsupervised, Human-Inspired Long-Term Memory Architecture for Edge-Based LLMs

Aneesh Jonelagadda, Christina Hahn et al.

· 2025

Mnemosyne combines a Commitment pipeline with substance and redundancy filters, a probabilistic Recall traversal over a graph-structured store, asynchronous Core Summary updates, and a Pruning module to manage long-term memory on edge devices. On the LoCoMo benchmark, Mnemosyne reaches 60.42% temporal reasoning J-score and a 54.55% overall J-score, compared to 51.55% temporal reasoning and 62.74% overall for Memory-R1, and achieves a 65.8% win rate over a 31.07% naive RAG baseline in human evaluations.

RAGBenchmarkAgent MemoryMemory Architecture

Semantic Anchoring in Agentic Memory: Leveraging Linguistic Structures for Persistent Conversational Context

Maitreyi Chatterjee, Devansh Agarwal

· 2025

Semantic Anchoring enriches conversational memory by combining a hybrid memory store with dense and symbolic indexes, structured memory representation tuples, hybrid storage and indexing, and a retrieval scoring method. On MultiWOZ-Long, Semantic Anchoring reaches 83.5% Factual Recall and 80.8% Discourse Coherence, beating Entity-RAG by 7.6 and 8.6 points respectively.

RAGMemory Architecture

TeleMem: Building Long-Term and Multimodal Memory for Agentic AI

Chunliang Chen, Ming Guan et al.

· 2025

TeleMem converts interactions into unified semantic nodes via the representation layer, organizes them in a memory graph with Insert and ReInsert, and reads them using closure-based retrieval and a ReAct-style multimodal agent. On ZH-4O, TeleMem reaches 86.33% QA Accuracy, beating the Mem0 baseline at 70.20% and the RAG baseline at 62.45%.

RAG

Understanding Users' Privacy Perceptions Towards LLM's RAG-based Memory

Shuning Zhang, Rongjun Ma et al.

· 2025

Understanding Users' Privacy Perceptions Towards LLM's RAG-based Memory analyzes users' mental models, privacy calculus, and expectations around RAG-based memory across generation, management, usage, and updating. Understanding Users' Privacy Perceptions Towards LLM's RAG-based Memory finds users demand explicit consent, fine-grained editing and deletion, and visibility into inferred information to trust RAG-based memory systems.

RAG

Deciphering the Interplay of Parametric and Non-parametric Memory in Retrieval-augmented Language Models

Mehrdad Farahani, Richard Johansson

· 2024

Deciphering the Interplay of Parametric and Non-parametric Memory instruments causal mediation analysis, Experiment 1, Experiment 2, and Path Specific Effects (PSE) inside ATLAS to trace how parametric and non-parametric memories compete token-by-token. Deciphering the Interplay of Parametric and Non-parametric Memory reports a strong shift toward counterfactual answers in altered contexts, with a t-test p-value of 1.60e-4 and Cohen’s d of -0.9851 for non-parametric versus parametric behavior.

RAG

Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation

Quanting Xie, So Yeon Min et al.

· 2024

Embodied-RAG builds a multimodal Topological Map and a hierarchical Semantic Forest and then runs Top-down Retrieval with LLM-based selection and hybrid re-ranking to drive Generation of waypoints and explanations. On the E-multimodal Embodied-Experiences dataset, Embodied-RAG reaches P(Q|A)=0.67 for implicit queries (Q only), compared to 0.13 for LightRAG, while building graph memory 9.76× faster than LightRAG.

RAG

"Ghost of the past": identifying and resolving privacy leakage from LLM's memory through proactive user interaction

Shuning Zhang, Lyumanshan Ye et al.

· 2024

MemoAnalyzer analyzes past inputs and long-term memories using prompt-based privacy inference, confidence and sensitivity visualization, and source tracking with an editing proxy. In a 5-day study on work, life, and academic tasks, MemoAnalyzer reduced total inferred private information by 22.3% compared to GPT memory while keeping completion time comparable to GPT and Manual baselines.

RAG

Retrieval-Augmented Decision Transformer: External Memory for In-context RL

Thomas Schmied, Fabian Paischer et al.

· 2024

Retrieval-Augmented Decision Transformer (RA-DT) combines a vector index, embedding model g(·), maximum inner product search, experience reweighting, and cross-attention layers to retrieve and fuse relevant sub-trajectories into a Decision Transformer policy. On Dark-Room 10×10, RA-DT reaches near-optimal average reward over 40 in-context trials while using a 50-step context window, whereas baselines like Algorithm Distillation require entire episodes of up to 100 steps.

RAG

Toward Conversational Agents with Context and Time Sensitive Long-term Memory

Nick Alonso, Tomás Figliolia et al.

· 2024

Toward Conversational Agents with Context and Time Sensitive Long-term Memory integrates a Tabular Chat Database, Classifying Query Type, Chain-of-Tables for Meta-Data Retrieval, and Combining Meta-Data and Semantic Retrieval to handle time-sensitive and ambiguous conversational queries. On the LoCoMo-derived temporal benchmark, Toward Conversational Agents with Context and Time Sensitive Long-term Memory achieves 90.32 average recall vs 31.93 for the best Semantic w MetaD baseline.