Memory Research

AI Memory Research, Explained Simply

A curated library of the most important papers on memory in AI systems — from foundational RAG to agentic long-term memory. Each paper explained in plain language.

200 papers curated4 must-reads16 categories

200 papers

Editor's picks

Must-read memory papers

PickRAGBenchmark

From RAG to Memory: Non-Parametric Continual Learning for Large Language Models

Bernal Jiménez Gutiérrez, Yiheng Shu et al.

ICML 2025 · 2025

HippoRAG 2 combines Offline Indexing, a schema-less Knowledge Graph, Dense-Sparse Integration, Deeper Contextualization, and Recognition Memory into a neuro-inspired non-parametric memory system for LLMs. On the joint RAG benchmark suite, HippoRAG 2 achieves 59.8 average F1 versus 57.0 for NV-Embed-v2, including 71.0 F1 on 2Wiki compared to 61.5 for NV-Embed-v2.

PickLong-Term Memory

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant et al.

arXiv 2025 · 2025

Mem0 incrementally processes conversations using the extraction phase, update phase, asynchronous summary generation module, tool call mechanism, and a vector database to build scalable long-term memory. On the LOCOMO benchmark, Mem0 attains a J score of 67.13 on single-hop questions versus 63.79 for OpenAI and cuts p95 latency from 17.117s to 1.440s compared to the full-context baseline.

PickMemory Architecture

MemOS: A Memory OS for AI System

Zhiyu Li, Chenyang Xi et al.

arXiv 2025 · 2025

MemOS introduces MemCube, MemScheduler, MemOperator, and MemLifecycle to treat plaintext, activation, and parameter memories as first-class resources with unified APIs and governance. MemOS achieves state-of-the-art performance across PreFEval, PersonaMem, LongMemEval, and LoCoMo compared to MIRIX, Mem0, Zep, Memobase, MemU, and Supermemory, though exact benchmark scores are only summarized qualitatively in Figure 1.

PickMemory Architecture

Titans: Learning to Memorize at Test Time

Ali Behrouz, Peilin Zhong, Vahab Mirrokni

arXiv 2025 · 2025

Titans combines a Core short-term attention block, a deep Long-term Memory module, and Persistent Memory tokens, with three integration variants: Memory as a Context (MAC), Memory as a Gate (MAG), and Memory as a Layer (MAL). On language modeling and reasoning benchmarks, Titans (MAC) at 760M parameters achieves 52.51 average accuracy vs 51.49 for Gated DeltaNet-H2, while also solving BABILong tasks that defeat GPT-4.

Browse all

Memory Architecture

A Control Architecture for Training-Free Memory Use

Yanzhen Lu, Muchen Jiang et al.

· 2026

TAG routes low-confidence steps to uncertainty-based routing, filters them with guarded acceptance with rollback, chooses between bank selection across rule and exemplar memory, and prunes via evidence-based retirement inside a unified control loop. On SVAMP and ASDiv, TAG reaches 81.0% and 85.2% accuracy, improving over the 74.0% and 77.5% no-memory baselines while a compute-matched Retry baseline stays flat.

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

Agent Memory

ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents

Xiaohui Zhang, Zequn Sun et al.

· 2026

ActMem transforms dialogue history into atomic facts via Memory Fact Extraction, groups them with Fact Clustering, links them through a Memory KG Construction module, and uses Counterfactual-based Retrieval and Reasoning for action-aware answers. On ActMemEval, ActMem reaches 76.52% QA accuracy with DeepSeek-V3, beating LightMem’s 63.97% by 12.55 points and NaiveRAG’s 61.54%.

Agent MemoryLong-Term Memory

Adaptive Memory Admission Control for LLM Agents

Guilin Zhang, Wei Jiang et al.

· 2026

A-MAC scores candidate memories using Utility, Confidence, Novelty, Recency, and Type Prior combined by a learned linear admission policy with Algorithm 1 A-MAC Memory Admission. On the LoCoMo benchmark, A-MAC achieves F1 0.583 and 2644 ms latency, improving F1 by 0.042 and reducing latency by 1187 ms compared to A-mem.

Long-Term Memory

Advancing Open-source World Models

Robbyant Team, Zelin Gao et al.

arXiv 2026 · 2026

LingBot-World combines a Data Engine, Fundamental World Model, Action-Conditioned World Model, and Post-Training causal adaptation to turn a 28B-parameter video generator into a real-time interactive world simulator. On the VBench benchmark, LingBot-World achieves a dynamic degree of 0.8857 versus 0.7612 for Yume-1.5, while also improving imaging quality to 0.6683.

RAG

A Dynamic Retrieval-Augmented Generation System with Selective Memory and Remembrance

Okan Bursa

· 2026

Adaptive RAG Memory (ARM) augments a standard retriever–generator stack with a Dynamic Embedding Layer and Remembrance Engine that track usage statistics and apply selective remembrance and decay to embeddings. On a lightweight retrieval benchmark, ARM achieves NDCG@5 ≈ 0.9401 and Recall@5 = 1.000 with 22M parameters, matching larger baselines like gte-small while providing the best efficiency among ultra-efficient models.

Cognitive ArchitectureAgent Memory

Aeon: High-Performance Neuro-Symbolic Memory Management for Long-Horizon LLM Agents

Mustafa Arslan

· 2026

Aeon restructures LLM memory using the Atlas, Trace, Semantic Lookaside Buffer, Write Ahead Log, and Sidecar Blob Arena inside a zero copy Core Shell kernel. Aeon achieves 4.70 ns INT8 dot products, 3.09 µs Atlas traversal at 100K nodes, 3.1× compression, and P99 read latency of 750 ns under 16 thread contention compared to FP32 and flat scan baselines.

BenchmarkBenchmarkLong-Term Memory

AgenticAI-DialogGen: Topic-Guided Conversation Generation for Fine-Tuning and Evaluating Short- and Long-Term Memories of LLMs

Manoj Madushanka Perera, Adnan Mahmood et al.

· 2026

AgenticAI-DialogGen chains ChatPreprocessor, KnowledgeExtractor, TopicAnalyzer, KnowledgeGraphBuilder, PersonaGenerator, DuelingChat Agent, ConversationValidator, ConversationRefiner, QAGeneration, and PostProcessing to turn raw multi-session chats into topic-guided, persona-grounded conversations with explicit short- and long-term memories. On the TGC / KG memory QA benchmark, Mistral-7B fine-tuned within AgenticAI-DialogGen achieves 87.36 F1, compared to GPT-4’s 83.77 F1 in a zero-shot setting on the same task.

BenchmarkAgent Memory

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu, Liuyi Yao et al.

arXiv 2026 · 2026

Agentic Memory (AgeMem) exposes memory management tools, a three-stage progressive RL strategy, and step-wise GRPO directly inside the agent policy to jointly control long-term and short-term memory. On Qwen3-4B-Instruct, AgeMem attains 54.31% average performance across ALFWorld, SciWorld, PDDL, BabyAI, and HotpotQA, exceeding the best baseline A-Mem at 45.74%.

BenchmarkAgent Memory

Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices

Yakov Pyotr Shkolnikov

· 2026

Agent Memory Below the Prompt stores each agent’s KV state in a block pool, quantizes it via a Q4 pipeline, reloads it with BatchQuantizedKVCache, and reuses it across phases using cross-phase context injection. On Gemma 3 12B, Agent Memory Below the Prompt reduces cold TTFT from 172,096 ms to 1,264 ms at 32K context (136×) compared to FP16 prefix caching baselines like vllm-mlx.

Cognitive ArchitectureAgent Memory

Aligning Progress and Feasibility: A Neuro-Symbolic Dual Memory Framework for Long-Horizon LLM Agents

Bin Wen, Ruoxuan Zhang et al.

· 2026

Neuro-Symbolic Dual Memory Framework uses Progress Memory, Feasibility Memory, a Blueprint Planner Agent, a Progress Monitor Agent, and an Actor Agent to decouple semantic progress guidance from executable feasibility checks. On ALFWorld, Neuro-Symbolic Dual Memory Framework achieves 94.78% success rate versus 88.81% for AWM, and on WebShop reaches 0.7132 score versus 0.5998 for WALL-E 2.0.

Benchmark

AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment

Jianfei Xiao, Xiang Yu et al.

· 2026

AlpsBench combines Personalized Information Extraction, Personalized Information Update, Personalized Information Retrieval, and Personalized Information Utilization over 2,500 WildChat dialogues with human-verified structured memories. AlpsBench shows, for example, that Gemini-3 Flash scores 51.67 on Task 1 Extraction while DeepSeek Reasoner reaches 0.9569 retrieval recall with 100 distractors on AlpsBench.

Agent MemoryLong-Term Memory

AMA: Adaptive Memory via Multi-Agent Collaboration

Weiquan Huang, Zixuan Wang et al.

· 2026

AMA orchestrates four agents — the Constructor, Retriever, Judge, and Refresher — to build Raw Text, Fact Knowledge, and Episode Memory and route queries adaptively across these granularities. On the LoCoMo benchmark with GPT-4.1-mini, AMA achieves an overall LLM Score of 0.805 compared to Nemori’s 0.774, while reducing token consumption by approximately 80% relative to FullContext.

BenchmarkBenchmarkLong-Term Memory

A-MBER: Affective Memory Benchmark for Emotion Recognition

Deliang Wen, Ke Sun, Yu Wang

· 2026

A-MBER builds multi-session conversational scenarios via a staged pipeline of persona specification, long-horizon planning, conversation generation, annotation, question construction, and benchmark-unit packaging. On A-MBER, a structured memory system reaches 0.69 judgment accuracy, 0.66 retrieval, and 0.65 explanation versus 0.34, 0.29, and 0.31 for a no-memory baseline.

BenchmarkAgent Memory

AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations

Cheng Jiayang, Dongyu Ru et al.

· 2026

AMemGym combines Structured Data Generation, On-Policy Interaction, Evaluation Metrics, and Meta-Evaluation to script user state trajectories, drive LLM-simulated role-play, and score write–read–utilization behavior. On AMemGym’s base configuration, AWE-(2,4,30) reaches a 0.291 normalized memory score on interactive evaluation, while native gpt-4.1-mini only achieves 0.203, exposing substantial gaps between memory agents and plain long-context LLMs.

Agent MemoryLong-Term Memory

AMV-L: Lifecycle-Managed Agent Memory for Tail-Latency Control in Long-Running LLM Systems

Emmanuel Bamidele

· 2026

AMV-L manages agent memory using a Memory Value Model, Tiered Lifecycle, Bounded Retrieval Path, and Lifecycle Manager to decouple retention from retrieval eligibility. Under a 70k-request long-running workload, AMV-L improves throughput from 9.027 to 36.977 req/s over TTL and reduces p99 latency from 5398.167 ms to 1233.430 ms while matching LRU’s retrieval quality.

SurveyAgent Memory

Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

Dongming Jiang, Yi Li et al.

arXiv 2026 · 2026

Anatomy of Agentic Memory organizes agentic memory into four structures using components like Lightweight Semantic Memory, Entity-Centric and Personalized Memory, Episodic and Reflective Memory, and Structured and Hierarchical Memory. Anatomy of Agentic Memory then reports comparative results such as Nemori’s 0.781 semantic judge score on LoCoMo versus SimpleMem’s 0.298, and latency differences like 1.129s for Nemori versus 32.372s for MemoryOS.

BenchmarkBenchmark

APEX-EM: Non-Parametric Online Learning for Autonomous Agents via Structured Procedural-Episodic Experience Replay

Pratyay Banerjee, Masud Moshtaghi, Ankit Chadha

· 2026

APEX-EM combines a Procedural Knowledge Graph, Experience Memory store, PRGII workflow, Task Verifiers, and StructuralSignatureExtractor to store and reuse full procedural-episodic traces without changing model weights. On KGQAGen-10k, APEX-EM reaches 89.6% accuracy (95.3% CSR) versus 41.3% without memory and surpasses the GPT-4o w/ SP oracle at 84.9%.

Benchmark

ARKV: Adaptive and Resource-Efficient KV Cache Management under Limited Memory Budget for Long-Context Inference in LLMs

Jianlong Lei, Shashikant Ilager

· 2026

ARKV dynamically combines Per-layer OQ ratio estimation, Token importance scoring, and Tri-state cache assignment to manage KV cache precision under a global memory budget. On LongBench, ARKV reaches 0.972 relative performance versus 0.979 for Origin while achieving 4× KV memory reduction and maintaining ~86% Tokens Per Second.

BenchmarkBenchmark

Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents

Yuxuan Cai, Jie Zhou et al.

· 2026

PROACTAGENT combines Experience-Enhanced Online Evolution (EXPONEVO), a structured EXPERIENCE BASE, and Proactive Reinforcement Learning-based Retrieval (PROACTRL) to jointly evolve memory and policy with retrieval as an explicit action. On SciWorld, PROACTAGENT reaches 73.50% SR versus 55.50% for GRPO+Reflexion, while cutting interaction rounds from 27.52 to 18.38.

SurveyBenchmarkAgent MemoryLong-Term MemoryMemory Architecture

A Survey on the Security of Long-Term Memory in LLM Agents: Toward Mnemonic Sovereignty

Zehao Lin, Chunyu Li, Kai Chen

· 2026

Mnemonic Sovereignty analyzes long term Write, Store, Retrieve, Execute, Share, and Forget Rollback phases against integrity, confidentiality, availability, and governance objectives for agent memory. Mnemonic Sovereignty’s lifecycle matrix shows most of the ~70 works cluster on write and retrieve integrity, leaving store, availability, and governance primitives like write gate validation and post deletion verification almost entirely unexplored.

BenchmarkAgent Memory

ATANT v1.1: Positioning Continuity Evaluation Against Memory, Long-Context, and Agentic-Memory Benchmarks

Samuel Sameer Tanguturi

· 2026

ATANT v1.1 structurally analyzes seven benchmarks using the 7 v1.0 continuity properties, the 10 checkpoints, a property-coverage matrix, and the Kenotic v1.0 reference implementation. ATANT v1.1 reports 96% ATANT cumulative-scale versus 8.8% LOCOMO substring accuracy, showing that LOCOMO, LongMemEval, BEAM, MemoryBench, Zep eval, MemGPT/Letta, and RULER measure different properties from continuity.

Memory Architecture

Auxiliary-predicted Compress Memory Model(ApCM Model): A Neural Memory Storage Model Based on Invertible Compression and Learnable Prediction

Weinuo Ou

· 2026

Auxiliary-predicted Compress Memory Model (ApCM Model) combines an Invertible Dimensionality Reduction and Predictor (IDRP) module with a Memory Read-Write Controller, including a global Memory Bank, cosine-similarity read, and access-frequency write policy. ApCM Model achieves lower MSE (0.987171 vs 1.001440) than a Key-Value Memory Network while compressing memory from 1024 to 128 dimensions on random data.

BenchmarkAgent MemoryLong-Term Memory

Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks

Zexue He, Yu Wang et al.

· 2026

MEMORYARENA orchestrates Memory-Agent-Environment Loops, Multi-Session Working Flow, Bundled Web Shopping, Group Travel Planning, and Progressive Web Search to stress-test how agents store and reuse information across sessions. MEMORYARENA’s main result is that agents with near-saturated scores on long-context benchmarks like LoCoMo still obtain Task Success Rates as low as 0.00–0.12 across its four environments.

BenchmarkBenchmarkLong-Term Memory

BenchPreS: A Benchmark for Context-Aware Personalized Preference Selectivity of Persistent-Memory LLMs

Sangyeon Yoon, Sunkyoung Kim et al.

· 2026

BenchPreS combines Contexts, User Profiles, Preference Attributes, Gold Labeling, and an LLM-as-Judge framework to test context-aware preference selectivity in persistent-memory LLMs. BenchPreS shows GPT-5.2 reaches 87.33% Appropriate Application Rate on BenchPreS while still having a 40.95% Misapplication Rate compared to Gemini 3 Pro’s 86.48% Misapplication Rate.

Memory Architecture

Beyond the Context Window: A Cost-Performance Analysis of Fact-Based Memory vs. Long-Context LLMs for Persistent Agents

Natchanon Pollertlam, Witchayut Kornsuwannawit

· 2026

Beyond the Context Window compares Conversation Segmentation, Fact Extraction, Embedding and Storage, and Retrieval Mechanism in a Mem0-based memory system against long-context GPT-5-mini. On LongMemEval, Beyond the Context Window finds LC GPT-5-mini reaches 82.40% accuracy, 33.4 percentage points above the memory system baseline.

Benchmark

Breaking the KV Cache Bottleneck: Fan Duality Model Achieves O(1) Decode Memory with Superior Associative Recall

Yasong Fan

· 2026

Fan Duality Model (FDM) uses the Fan Operator, Local-Global Cache, Freeze-Scan Training, and Holographic Reference Beam Decoding to separate wave-like compression from particle-like associative recall. On WikiText-103, Fan Duality Model (FDM) reaches 64.9 perplexity with Freeze-Scan and 62.79 with holographic decoding, while achieving 0.966 MQAR accuracy compared to Transformer at 0.606.

Long-Term Memory

Chronos: Temporal-Aware Conversational Agents with Structured Event Retrieval for Long-Term Memory

Sahil Sen, Elias Lumer et al.

· 2026

Chronos decomposes dialogue into structured events via the Event Extraction pipeline, stores them in dual event calendar and turn calendar indexes, and uses Dynamic Prompting, Initial Retrieval, and the Chronos Agent for temporal-aware tool-calling. On LongMemEvalS, Chronos Low reaches 92.60% overall accuracy and Chronos High 95.60%, beating EmergenceMem Internal by 7.67 percentage points and Mastra’s OM by 3.02 points.

BenchmarkAgent Memory

ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents

Mofasshara Rafique, Laurent Bindschaedler

· 2026

ClawVM manages agent state as typed pages via the SessionPageTable, RepresentationSelector, FaultObserver, WritebackJournal, and ClawVMEngine inside the agent harness. Across four OpenClaw-derived workloads and six token budgets, ClawVM cuts explicit faults from 67.8 (retrieval baseline) and 1.5 (Compaction-Hybrid) to 0.0 while adding median <50 μs policy-engine overhead per turn.

Benchmark

Codebase-Memory: Tree-Sitter-Based Knowledge Graphs for LLM Code Exploration via MCP

Martin Vogel, Falk Meyer-Eschenbach et al.

· 2026

Codebase-Memory parses repositories with a multi-pass pipeline using the Parse stage, Build stage, Serve stage, FunctionRegistry, Louvain communities, and MCP tool interface to build a persistent SQLite knowledge graph. On a 31-language benchmark, Codebase-Memory reaches 0.83 quality versus 0.92 for an Explorer Agent while using ten times fewer tokens and 2.1 times fewer tool calls.

Cognitive ArchitectureAgent Memory

D-Mem: A Dual-Process Memory System for LLM Agents

Zhixing You, Jiachen Yuan, Jason Cai

· 2026

D-Mem combines Mem0∗, Quality Gating, and Full Deliberation into a dual-process memory system that incrementally stores vector memories and selectively scans raw history. On LoCoMo with GPT-4o-mini, D-Mem’s Quality Gating reaches 53.5 F1 versus the Mem0∗ baseline’s 51.2 F1, recovering 96.7% of the 55.3 F1 Full Deliberation performance with far fewer tokens.

BenchmarkAgent MemoryLong-Term Memory

Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents

Benjamin Stern, Peter Nadel

· 2026

Drawing on Memory uses dual-trace memory encoding, an evidence scoring gate, and a three-state retrieval protocol to store paired fact and scene traces in Letta’s archival memory. On LongMemEval-S, Drawing on Memory reaches 73.7% accuracy versus 53.5% for the fact-only C7-control baseline, a +20.2 percentage point gain concentrated in temporal, update, and multi-session questions.

BenchmarkAgent Memory

Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents

Xing Zhang, Guanghui Wang et al.

· 2026

Experience Compression Spectrum organizes Level 0 Raw Trace, Level 1 Episodic Memory, Level 2 Procedural Skill, and Level 3 Declarative Rule into a unified scaffold-level compression framework. Experience Compression Spectrum’s mapping of 20+ systems and <1% cross-citation rate shows that all existing agents fix a single compression level and never perform adaptive cross-level compression.

BenchmarkAgent MemoryMemory Architecture

GAM: Hierarchical Graph-based Agentic Memory for LLM Agents

Zhaofen Wu, Hanrong Zhang et al.

· 2026

GAM builds a Hierarchical Graph Memory Architecture with a global Topic Associative Network, local Event Progression Graphs, State-Based Memory Consolidation, and Graph-Guided Multi-Factor Retrieval to decouple encoding from consolidation. On LoCoMo with Qwen2.5-7B, GAM attains an Average F1 of 40.00 compared to Mem0’s 35.38, and on LongDialQA with Qwen2.5-7B, GAM reaches 12.55 F1 vs MemoryOS at 6.76.

BenchmarkBenchmarkAgent MemoryLong-Term Memory

Governing Evolving Memory in LLM Agents: Risks, Mechanisms, and the Stability and Safety Governed Memory (SSGM) Framework

Chingkwun Lam, Jiaxin Li et al.

· 2026

SSGM interposes a Governance Middleware, Read Filtering Gate, Write Validation Gate, and a dual substrate of Mutable Active Graph plus Immutable Episodic Log between agents and memory. SSGM unifies evolving-memory systems into a four-dimensional failure taxonomy and proves that periodic reconciliation can bound semantic drift over infinite horizons.

Benchmark

GradMem: Learning to Write Context into Memory with Test-Time Gradient Descent

Yuri Kuratov, Matvey Kairov et al.

· 2026

GradMem combines a WRITE phase, a READ phase, a context encoder Eθ, a self-supervised WRITE objective Lwrite, and a meta-learned initialization M0 to optimize prefix memory tokens via test-time gradient descent while keeping model weights frozen. On associative KV-retrieval with 96 key–value pairs, GradMem with 5 gradient WRITE steps reaches 88.4% exact match versus 12.9% for forward-only RMT with the same 8-vector memory.

RAGLong-Term Memory

HingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable Dialogues

Yijie Zhong, Yunfan Gao, Haofen Wang

· 2026

HingeMem combines Boundary Guided Long-Term Memory, Dialogue Boundary Extraction, Memory Construction, Query Adaptive Retrieval, Hyperedge Rerank, and Adaptive Stop to segment dialogues into element-indexed hyperedges and plan query-specific retrieval. On LOCOMO, HingeMem achieves 63.9 overall F1 and 75.1 LLM-as-a-Judge score, surpassing the best baseline Zep (56.9 F1) by 7.0 F1 without using category-specific QA formats.

Cognitive ArchitectureLong-Term Memory

Human-Like Lifelong Memory: A Neuroscience-Grounded Architecture for Infinite Interaction

Diego C. Lerma-Torres

· 2026

Human-Like Lifelong Memory combines Executive Function and Working Memory, a Memory Service Knowledge Graph, and a Thalamic Gateway to implement dual-process, valence-aware lifelong memory. Human-Like Lifelong Memory is a theoretical framework with seven functional properties and testable predictions rather than benchmark numbers against specific baselines.

BenchmarkBenchmarkCognitive Architecture

Learning to Forget: Sleep-Inspired Memory Consolidation for Resolving Proactive Interference in Large Language Models

Ying Xie

· 2026

SleepGate augments transformers with a Conflict-Aware Temporal Tagger, Forgetting Gate, Consolidation Module, and Sleep Trigger that periodically rewrite the KV cache during sleep micro-cycles. On the PI-LLM benchmark, SleepGate achieves 99.5% retrieval accuracy at PI depth 5 and 97.0% at depth 10, while full KV cache, sliding window, H2O, StreamingLLM, and a decay-only ablation all stay below 18% across all depths.

BenchmarkAgent MemoryLong-Term MemoryMemory Architecture

Lightweight LLM Agent Memory with Small Language Models

Jiaquan Zhang, Chaoning Zhang et al.

· 2026

LightMem orchestrates SLM-1 Controller, SLM-2 Selector, SLM-3 Writer, and STM MTM LTM stores to modularize retrieval, writing, and offline consolidation. On LoCoMo, LightMem reaches 34.50 F1 for GPT-4o multi hop questions, +1.64 over A-MEM, while keeping median retrieval latency at 83 ms.

Long-Term Memory

LPC-SM: Local Predictive Coding and Sparse Memory for Long-Context Language Modeling

Keqin Xie

· 2026

LPC-SM combines local attention, dual-timescale memory, predictive correction, Orthogonal Novelty Transport, and multi-head-coupled residual routing (mHC) inside a single autoregressive block. On OpenWebMath-10k continuation, LPC-SM with adaptive sparse control reaches final LM loss 10.787 versus 12.137 for a fixed sparse controller, a 12.517% improvement.

Agent Memory

MAGMA: A Multi-Graph based Agentic Memory Architecture for AI Agents

Dongming Jiang, Yi Li et al.

· 2026

MAGMA organizes agent memory with an Intent-Aware Router, Adaptive Topological Retrieval, a Data Structure Layer of Relation Graphs and Vector Database, plus dual-stream Synaptic Ingestion and Asynchronous Consolidation. On LoCoMo, MAGMA achieves a 0.700 overall LLM-as-a-Judge score versus 0.590 for Nemori, and reaches 61.2% average accuracy on LongMemEval versus 56.2% for Nemori.

BenchmarkBenchmarkBenchmarkAgent MemoryLong-Term Memory

MemEvoBench: Benchmarking Memory MisEvolution in LLM Agents

Weiwei Xie, Shaoxiong Guo et al.

· 2026

MemEvoBench combines Misleading Memory Injection, Noisy Tool Returns, Biased User Feedback, and a Memory Modification Tool (+ModTool) to stress-test long-term memory safety in LLM agents across 7 domains and 36 risk types. On the QA Style benchmark, MemEvoBench shows Gemini-2.5-Pro’s ASR drops from 67.0% (Vanilla) to 19.0% with +ModTool in Round 1, while biased feedback can push GPT-5’s QA ASR from 59.0% to 78.0% by Round 3.

Agent Memory

Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory

Zhenting Wang, Huancheng Chen et al.

· 2026

Memex(RL) optimizes Indexed Experience Memory, CompressExperience, ReadExperience, and ContextStatus so Memex keeps only an indexed summary in-context while archiving full artifacts externally. On modified ALFWorld, Memex(RL) lifts task success from 24.22% to 85.61% over the Memex agent without RL while reducing peak working context from 16,934.46 to 9,634.47 tokens.

RAGBenchmarkLong-Term Memory

MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents

Shu Wang, Edwin Yu et al.

· 2026

MemMachine combines Short-term memory, Long-term memory, Profile memory, and the Retrieval Agent to store raw conversational episodes and retrieve clustered context around nucleus matches. On LoCoMo, MemMachine scores 0.9169 with gpt-4.1-mini while using about 80% fewer input tokens than Mem0, and reaches 93.0% on LongMemEvalS with GPT-5-mini.

RAGAgent MemoryLong-Term MemoryMemory Architecture

Memory as Metabolism: A Design for Companion Knowledge Systems

Stefan Miteski

· 2026

Memory as Metabolism defines companion knowledge systems with five retention operations (TRIAGE, DECAY, CONTEXTUALIZE, CONSOLIDATE, AUDIT) plus memory gravity and minority-hypothesis retention over a raw buffer, active wiki, and cold memory. Instead of benchmark gains, Memory as Metabolism’s main result is a governance specification that separates descriptive, taxonomic, and normative claims and predicts improved coherence stability, fragility resistance, monoculture resistance, and effective minority-hypothesis influence for companion wikis.

BenchmarkBenchmarkAgent Memory

MemoryCD: Benchmarking Long-Context User Memory of LLM Agents for Lifelong Cross-Domain Personalization

Weizhi Zhang, Xiaokai Wei et al.

· 2026

MEMORYCD builds a user memory pool Mu from lifelong Amazon Review histories and evaluates long-context prompting, Mem0, LoCoMo, ReadAgent, MemoryBank, and A-Mem across rating, ranking, and personalized text tasks. On Books and Home & Kitchen, MEMORYCD shows GPT-5 reaches RMSE 0.551–0.624 and NDCG@3 up to 0.610, while Gemini-2.5 Pro peaks at ROUGE-L 0.222 for generation, revealing substantial remaining gaps to real user behavior.

SurveyRAGAgent Memory

Memory for Autonomous LLM Agents:Mechanisms, Evaluation, and Emerging Frontiers

Pengfei Du

· 2026

Memory for Autonomous LLM Agents decomposes agent memory into a POMDP-grounded write–manage–read loop, a three-dimensional taxonomy, and five mechanism families spanning context compression, retrieval stores, reflection, hierarchical virtual context, and policy-learned management. Memory for Autonomous LLM Agents synthesizes results like Voyager’s 15.3× tech-tree speedup and MemoryArena’s 80%→45% drop to show that memory architecture often matters more than backbone choice.

Agent Memory

Multi-Agent Memory from a Computer Architecture Perspective: Visions and Challenges Ahead

Zhongming Yu, Naicheng Yu et al.

arXiv 2026 · 2026

Multi-Agent Memory Architecture organizes Agent IO Layer, Agent Cache Layer, Agent Memory Layer, Agent Cache Sharing, and Agent Memory Access Protocol into a computer-architecture-style design for LLM agents. Multi-Agent Memory Architecture’s main result is a conceptual unification of shared and distributed memory plus a research agenda for multi-agent memory consistency instead of benchmark gains.

Long-Term Memory

OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent

Bowen Yang, Kaiming Jin et al.

arXiv 2026 · 2026

OS-SYMPHONY coordinates an Orchestrator, Reflection-Memory Agent, and Versatile Tool Agents (Multimodal Searcher, Grounders, Coder) to stabilize long-horizon GUI workflows and fetch visual tutorials on demand. On OSWorld-Verified, OS-SYMPHONY with GPT-5 scores 65.84% at 100 steps, beating Agent S3 w/ GPT-5 (62.63%) by 3.21 percentage points.

Memory Architecture

ACON: Optimizing Context Compression for Long-horizon LLM Agents

Minki Kang, Wei-Ning Chen et al.

arXiv 2025 · 2025

ACON combines History Compression, Observation Compression, Compression Guideline Optimization, and Compressor Distillation to rewrite agent histories and observations into concise, task-aware summaries. On AppWorld, ACON UTCO with gpt-4.1 achieves 56.5% accuracy with 7.33k peak tokens, versus 56.0% accuracy with 9.93k peak tokens for No compression.

Agent Memory

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang et al.

· 2025

A-MEM organizes agent memory via Note Construction, Link Generation, Memory Evolution, and Retrieve Relative Memory to build an evolving, interconnected note graph. On the LoCoMo dataset, A-MEM with GPT-4o-mini reaches 27.02 F1 on Multi Hop questions, +17.87 over ReadAgent, while cutting average token length from 16,910 to 2,520.

Agent Memory

A-MemGuard: A Proactive Defense Framework for LLM-Based Agent Memory

Qianshan Wei, Tengchao Yang et al.

· 2025

A-MemGuard combines consensus-based validation, dual-memory structure, lesson memory, and path divergence scoring to sanitize retrieved memories and revise actions using past failures. On EHRAgent under AgentPoison, A-MemGuard reduces ASR-r from 100.0% to 2.13% and ASR-t from 100.0% to 6.38%, far below LLM Auditor and Distil Classifier.

BenchmarkLong-Term Memory

Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs

Mohammad Tavakoli, Alireza Salemi et al.

arXiv 2025 · 2025

LIGHT augments LLMs with Retrieval from the Conversation, Scratchpad Formation and Utilization, and a Working Memory buffer plus noise filtering to answer BEAM’s long-context probing questions. On the BEAM benchmark, LIGHT raises GPT-4.1-nano’s average score at 10M-token conversations from 0.109 to 0.226, a +107.3% gain over the vanilla long-context baseline.

RAGBenchmarkBenchmarkBenchmarkAgent MemoryLong-Term MemoryMemory Architecture

Evaluating Long-Term Memory for Long-Context Question Answering

Alessandra Terranova, Björn Ross, Alexandra Birch

· 2025

Evaluating Long-Term Memory for Long-Context Question Answering compares Full Context, RAG, A-Mem, RAG+PromptOpt, and RAG+EpMem memory components across semantic, episodic, and procedural memory for long conversational QA. On LoCoMo, RAG+EpMem reaches an average F1 ranking of 1.83 for Llama 3.2-3B Instruct and 1.80 for GPT-4o mini while using around 1,000 tokens per query versus over 23,000 for Full Context.

Agent Memory

Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Yuanzhe Hu, Yu Wang, Julian McAuley

ICLR 2026 · 2025

MemoryAgentBench standardizes multi-turn datasets into chunked conversations with memorization prompts, then evaluates long-context agents, RAG agents, and agentic memory agents across Accurate Retrieval, Test-Time Learning, Long-Range Understanding, and Selective Forgetting. On the overall score in Table 3, the GPT-4.1-mini long-context agent reaches 71.8 on Accurate Retrieval tasks compared to 49.2 for the GPT-4o-mini long-context baseline.

BenchmarkBenchmarkAgent MemoryMemory Architecture

Forgetful but Faithful: A Cognitive Memory Architecture and Benchmark for Privacy-Aware Generative Agents

Saad Alqithami

· 2025

MaRS organizes agent memory into episodic, semantic, social, and task nodes with provenance, scored by a privacy-aware retention controller and governed by FIFO, LRU, Priority Decay, Reflection-Summary, Random-Drop, and Hybrid policies. On the FiFA benchmark, the Hybrid policy in MaRS achieves a composite score of ≈0.911 across 300 runs and five memory budgets, outperforming simpler policies while preserving privacy and cost efficiency.

Survey

From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs

Yaxiong Wu, Sheng Liang et al.

arXiv 2025 · 2025

From Human Memory to AI Memory maps human memory categories onto AI memory using the 3D-8Q taxonomy with Personal Memory, System Memory, and the Three-Dimensional Eight-Quadrant Memory Taxonomy. The main result is that From Human Memory to AI Memory systematically organizes memory in LLM-driven AI systems across eight quadrants defined by object, form, and time, connecting them to human memory types.

Agent Memory

General Agentic Memory Via Deep Research

B.Y. Yan, Chaofan Li et al.

arXiv 2025 · 2025

General Agentic Memory (GAM) combines a Memorizer, Researcher, page-store, and memory to keep full trajectories while constructing lightweight guidance for deep research. On RULER 128K retrieval, GAM achieves 97.70% accuracy compared to 94.25% for RAG using GPT-4o-mini, while also reaching 64.07 F1 on HotpotQA-56K.

Agent MemoryMemory Architecture

Hindsight is 20/20: Building Agent Memory that Retains, Recalls, and Reflects

Chris Latimer, Nicoló Boschi et al.

· 2025

HINDSIGHT organizes agent memory into four networks via TEMPR and layers CARA on top to retain, recall, and reflect with explicit opinions and behavioral profiles. On LongMemEval, HINDSIGHT with Gemini-3 Pro scores 91.4% overall versus 60.2% for full-context GPT-4o, while HINDSIGHT with OSS-20B jumps from 39.0% to 83.6% over a full-context OSS-20B baseline.

Agent Memory

IMDMR: An Intelligent Multi-Dimensional Memory Retrieval System for Enhanced Conversational AI

Tejas Pawar, Sarika Patil et al.

· 2025

IMDMR combines a Memory Storage Layer, Multi-Dimensional Search Engine, Intelligent Query Processor, and Response Generation Module to retrieve conversational memories across semantic, entity, category, intent, context, and temporal dimensions. On the synthetic 1,000 conversation benchmark, IMDMR-Prod achieves an overall score of 0.792 compared to 0.207 for spaCy + RAG, a 3.8x improvement.

RAGBenchmarkBenchmarkMemory Architecture

Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling

Chulun Zhou, Chunkang Zhang et al.

· 2025

HGMEM represents working memory as a hypergraph with Hypergraph-based Memory Storage, Adaptive Memory-based Evidence Retrieval, and Dynamic Memory Evolving to build high-order correlations across entities and facts. On Prelude long narrative understanding, HGMEM with GPT-4o achieves 73.81% accuracy compared to 72.22% for HippoRAG v2, while also reaching 69.74 comprehensiveness on Longbench generative sense-making QA.

Benchmark

In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents

Zhen Tan, Jun Yan et al.

ACL 2025 · 2025

Reflective Memory Management (RMM) uses a memory bank, retriever, reranker, and LLM to implement Prospective Reflection and Retrospective Reflection for topic-based storage and RL-based retrieval refinement. On LongMemEval, RMM with GTE achieves 69.8% Recall@5 and 70.4% accuracy, compared to 62.4% Recall@5 and 63.6% accuracy for GTE RAG.

Benchmark

Latent learning: episodic memory complements parametric learning by enabling flexible reuse of experiences

Andrew Kyle Lampinen, Martin Engelcke et al.

· 2025

Latent learning uses oracle episodic retrieval, within-experience in-context learning, parametric learning, and latent learning benchmarks to study how reinstating full past episodes into context changes generalization. Latent learning shows that oracle retrieval solves latent tests like reversals and codebooks where parametric-only transformers stay near 0% despite strong performance on forward and in-context variants.

RAGBenchmarkBenchmarkMemory Architecture

Learning from Supervision with Semantic and Episodic Memory: A Reflective Approach to Agent Adaptation

Jackson Hassell, Dan Zhang et al.

· 2025

Learning from Supervision with Semantic and Episodic Memory combines a performance agent, critic agent, semantic memory, episodic memory, and memory retriever to turn label-grounded critiques into reusable supervision without parameter updates. On the Multi-Condition Ranking dataset with Mixtral 8x22B and o4-mini as critic, Learning from Supervision with Semantic and Episodic Memory reaches 85.6% accuracy, a 24.8% gain over the EP_LABEL baseline at 60.8%.

Agent MemoryLong-Term MemoryMemory Architecture

LiCoMemory: Lightweight and Cognitive Agentic Memory for Efficient Long-Term Reasoning

Zhengjun Huang, Zhoujin Tian et al.

· 2025

LiCoMemory organizes long term dialogue with CogniGraph, Query Processing and Integrated Rerank, and Real Time Interactions to keep session summaries, triples, and chunks linked. On LongMemEval with GPT-4o-mini, LiCoMemory reaches 73.80% accuracy and 76.63% recall, beating Mem0g by 9.0 and 7.1 points.

BenchmarkLong-Term MemoryMemory Architecture

LightMem: Lightweight and Efficient Memory-Augmented Generation

Jizhan Fang, Xinle Deng et al.

· 2025

LightMem pipelines a Cognitive-Inspired Sensory Memory, Topic Segmentation Submodule, Topic-Aware Short-Term Memory, and Long-Term Memory with Sleep-Time Update to filter, group, summarize, and asynchronously consolidate dialogue history. On LongMemEval-S with Qwen3-30B-A3B-Instruct-2507, LightMem reaches 70.20% ACC vs 65.20% for A-MEM (+5.00 points) while reducing total token usage by up to 21.8× and API calls by up to 17.1×.

RAG

LLM-Empowered Embodied Agent for Memory-Augmented Task Planning in Household Robotics

Marc Glocker, Peter Hönig et al.

· 2025

LLM-Empowered Embodied Agent for Memory-Augmented Task Planning in Household Robotics coordinates a routing agent, task planning agent, and knowledge base agent over RAG and ChromaDB to translate household commands into grounded robot actions. In three tabletop scenarios, Qwen2.5-32B in LLM-Empowered Embodied Agent for Memory-Augmented Task Planning in Household Robotics achieves 84.3% total lenient task planning accuracy versus 68.7% for Gemma2-27B and 61.1% for LLaMa3.1-8B.

Agent Memory

MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents

Haoran Tan, Zeyu Zhang et al.

ACL 2025 · 2025

MemBench evaluates LLM-based agents with multi-scenario datasets, multi-level memory content, and a time-aware benchmark using components like Multi-scenario Dataset, Multi-level Memory Content, and Multi-metric Evaluation. MemBench shows that mechanisms such as GenerativeAgent, MemGPT, MemoryBank, and SCMemory can drop from accuracies around 0.7 on 10k-token settings to roughly 0.3–0.4 at 100k tokens, exposing clear capacity limits.

Benchmark

Memento: Fine-tuning LLM Agents without Fine-tuning LLMs

Huichi Zhou, Yihang Chen et al.

arXiv 2025 · 2025

Memento combines a Planner, Executor, Case Memory, Subtask Memory, and Tool Memory inside a memory-based MDP, learning a neural case-selection policy over a growing Case Bank instead of updating LLM weights. On GAIA, Memento achieves 87.88% Pass@3 on validation and 79.40% on the test set, surpassing DeepResearcher’s 51.8% F1 and 60.5% PM on DeepResearcher benchmarks by +14.8 F1 and +19.9 PM.

BenchmarkAgent MemoryMemory Architecture

MemEvolve: Meta-Evolution of Agent Memory Systems

Guibin Zhang, Haotian Ren et al.

· 2025

MemEvolve decomposes agent memory into Encode, Store, Retrieve, and Manage modules and meta evolves these components via a dual evolution process over candidate architectures. On xBench DeepSearch, MemEvolve with GPT 5 mini raises Flash Searcher pass@1 from 69.0 to 74.0 and WebWalkerQA accuracy from 58.82 to 61.18 while keeping API cost near 0.141 per query.

RAG

MemInsight: Autonomous Memory Augmentation for LLM Agents

Rana Salama, Jason Cai et al.

· 2025

MemInsight augments agent memory using Attribute Mining, Annotation and Attribute Prioritization, and Memory Retrieval modules that generate and exploit structured attributes over past interactions. On the LoCoMo question answering benchmark, MemInsight with Claude-3-Sonnet priority augmentation achieves 60.5% Recall@5 versus 26.5% for DPR, a 34.0-point improvement.

BenchmarkBenchmarkAgent Memory

Memoria: A Scalable Agentic Memory Framework for Personalized Conversational AI

Samarth Sarin, Lovepreet Singh et al.

· 2025

Memoria augments LLM chats with structured conversation logging, dynamic user persona via KG, session level memory for real time context, and seamless retrieval for context aware responses to provide persistent, interpretable memory. On LongMemEvals single-session-user and knowledge-update subsets, Memoria reaches 87.1% and 80.8% accuracy respectively, surpassing A-Mem (OpenAI) while using much shorter prompts.

BenchmarkAgent Memory

MemoriesDB: A Temporal-Semantic-Relational Database for Long-Term Agent Memory / Modeling Experience as a Graph of Temporal-Semantic Surfaces

Joel Ward

· 2025

MemoriesDB stores each Memory Record, Edges and Relations, and the Temporal Semantic Stack inside PostgreSQL with pgvector, exposing unified temporal–semantic–relational queries. MemoriesDB’s main result is a working implementation that demonstrates scalable time-bounded recall and hybrid semantic–structural queries on commodity SQL infrastructure without specialized vector or graph engines.

Memory Architecture

Memorization to Generalization: Emergence of Diffusion Models from Associative Memory

Bao Pham, Gabriel Raya et al.

arXiv 2025 · 2025

Memorization to Generalization recasts diffusion training and sampling as Dense Associative Memory dynamics, analyzing memorized, spurious, and generalized states via energy basins and curvature. Memorization to Generalization shows that as training size grows on MNIST, CIFAR10, FASHION-MNIST, LSUN-CHURCH, and Stable Diffusion, spurious states peak at the memorization–generalization boundary and have distinct basin volume and curvature signatures.

Cognitive ArchitectureLong-Term Memory

Memory as Resonance: A Biomimetic Architecture for Infinite Context Memory on Ergodic Phonetic Manifolds

Tarik Houichime, Abdelghani Souhar, Younes El Amrani

· 2025

Phonetic Trajectory Memory (PTM) combines the Acoustic Injection, Entropy Filter, Neuro-Symbolic Relay, and Resonance Engine to encode text as a continuous trajectory on an ergodic Hyper-Torus Memory instead of a growing KV cache. PTM delivers >3,000× signal-to-KV compression while maintaining ≈92% factual accuracy and sub-50 ms retrieval latency on long narrative and scientific corpora compared to dense KV baselines.

RAGBenchmarkBenchmarkMemory Architecture

Memory-Augmented Log Analysis with Phi-4-mini: Enhancing Threat Detection in Structured Security Logs

Anbi Guo, Mahfuza Farooque

· 2025

DM-RAG augments Phi-4-mini with a Short-Term Memory (STM) buffer, Long-Term Memory (LTM) FAISS store, Bayesian fusion, and a logistic regression confidence model for structured log analysis. On UNSW-NB15, DM-RAG reaches 98.70% recall and 69.59% F1, beating the Phi-4 + RAG (MITRE) baseline in F1 by 17.89 points.

SurveyCognitive ArchitectureMemory Architecture

Memory-Augmented Transformers: A Systematic Review from Neuroscience Principles to Enhanced Model Architectures

Parsa Omidi, Xingshuai Huang et al.

arXiv 2025 · 2025

Memory-Augmented Transformers organizes functional objectives, memory types, and integration techniques into a unified taxonomy that connects biological memory principles with concrete architectures like Memformer, Titans, ATLAS, and EMAT. Memory-Augmented Transformers’ main result is a systematic three-dimensional classification that links dynamic multi-timescale memory, selective attention, and consolidation to specific Transformer designs and emerging lifelong-learning paradigms.

BenchmarkBenchmark

MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

Qingyao Ai, Yichen Tang et al.

arXiv 2025 · 2025

MemoryBench orchestrates Task Provider, User Simulator, and Performance Monitor to feed heterogeneous tasks, simulate explicit and implicit feedback, and score LLM systems across declarative and procedural memory. MemoryBench’s main finding is that state-of-the-art memory systems like A-Mem, Mem0, and MemoryOS often fail to beat naive BM25 or embedding-based RAG on partitions such as SiLo and LiLo.

RAGBenchmarkMemory Architecture

Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models

Jiaqi Cao, Jiarui Wang et al.

· 2025

Memory Decoder combines a Pre-training stage that aligns with kNN-LM distributions and an Inference interpolation mechanism that mixes Memory Decoder and base LLM outputs without changing base parameters. On Wikitext-103, Memory Decoder with 124M parameters reaches 13.36 perplexity on GPT2-small versus 14.76 for DAPT, and on specialized domains a single 0.5B Memory Decoder reduces average perplexity from 14.88 to 4.05 on Qwen2-0.5B.

RAGBenchmarkAgent Memory

Memory in the Age of AI Agents

Yuyang Hu, Shichun Liu et al.

· 2025

Memory in the Age of AI Agents formalizes agent memory with Memory Formation, Memory Evolution, and Memory Retrieval operators, and classifies memories into token-level, parametric, and latent forms plus factual, experiential, and working functions. Memory in the Age of AI Agents’ main result is a unified Forms–Functions–Dynamics framework that consolidates fragmented LLM agent memory work, benchmarks, and open-source frameworks into a coherent taxonomy.

Benchmark

MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models

Zhiyu Li, Shichao Song et al.

· 2025

MemOS standardizes memory as MemCube units orchestrated by MemReader, MemScheduler, MemLifecycle, MemOperator, MemVault, and MemGovernance to manage parametric, activation, and plaintext memory as one system. MemOS does not report benchmark numbers but instead contributes a unified architecture and Memory Interchange Protocol for cross-LLM memory sharing and lifecycle governance.

Benchmark

Memp : Exploring Agent Procedural Memory

Runnan Fang, Yuan Liang et al.

· 2025

Memp constructs agent skills via Build, Retrieve, and Update modules that turn past trajectories into scripts, trajectories, and combined proceduralizations stored in a procedural memory library. On ALFWorld, Memp’s proceduralization with GPT-4o reaches 77.86% test success versus 42.14% with no memory, while reducing steps from 23.76 to 15.01.

BenchmarkRAG

MEPIC: Memory Efficient Position Independent Caching for LLM Serving

Qian Wang, Zahra Yousefijamarani et al.

· 2025

MEPIC extends vLLM with a Chunk Cache Coordinator, Chunk Matcher, Hybrid KV Manager, Chunk LRU Manager, and Chunk Processor to manage canonical, page-aligned, position-independent KV chunks in HBM. On long-context workloads, MEPIC reduces HBM usage by up to 5.21× and lowers latency by up to 11.48% compared to CacheBlend on Mistral-7B-Instruct-v0.3.

RAG

MeVe: A Modular System for Memory Verification and Effective Context Control in Language Models

Andreas Ottem

· 2025

MeVe decomposes retrieval into Initial Retrieval, Relevance Verification, Fallback Retrieval, Context Prioritization, and Token Budgeting to tightly control what enters the LLM context. On a Wikipedia subset and HotpotQA, MeVe reduces average context from 188.8 to 79.8 tokens and from 308.6 to 78.5 tokens respectively compared to Standard RAG while keeping retrieval time comparable.

Long-Term Memory

M+: Extending MemoryLLM with Scalable Long-Term Memory

Yu Wang, Dmitry Krotov et al.

· 2025

M+ augments short-term memory θ, long-term memory Θ, a co-trained retriever, and a Multi-LoRA design on top of MemoryLLM’s layer-wise memory pools. On SQuAD-style knowledge retention, M+ maintains accuracy beyond 160k tokens while MemoryLLM-7B collapses before 20k and Llama-3.1-8B-SnapKV fails beyond 30k tokens.

Benchmark

MIRIX: Multi-Agent Memory System for LLM-Based Agents

Yu Wang, Xi Chen

· 2025

MIRIX organizes Core Memory, Episodic Memory, Semantic Memory, Procedural Memory, Resource Memory, and Knowledge Vault under multi-agent control with Active Retrieval for topic-driven access. On ScreenshotVQA, MIRIX reaches 0.5950 accuracy vs 0.4410 for SigLIP@50 while shrinking storage from 15.07GB to 15.89MB, and on LOCOMO MIRIX scores 85.38% vs 79.09% for Zep.

BenchmarkMemory Architecture

MMAG: Mixed Memory-Augmented Generation for Large Language Models Applications

Stefano Zeppieri

· 2025

MMAG organizes conversational memory, long-term user memory, episodic and event-linked memories, sensory and context-aware memory, and short-term working memory under a modular memory controller integrated with Heero’s encrypted Firestore and S3 stores. MMAG delivers a 20% increase in user retention and a 30% increase in average conversation duration on the Heero language learning platform compared to its pre-memory deployment.

RAGLong-Term MemoryMemory Architecture

Mnemosyne: An Unsupervised, Human-Inspired Long-Term Memory Architecture for Edge-Based LLMs

Aneesh Jonelagadda, Christina Hahn et al.

· 2025

Mnemosyne combines a Commitment pipeline with substance and redundancy filters, a probabilistic Recall traversal over a graph-structured store, asynchronous Core Summary updates, and a Pruning module to manage long-term memory on edge devices. On the LoCoMo benchmark, Mnemosyne reaches 60.42% temporal reasoning J-score and a 54.55% overall J-score, compared to 51.55% temporal reasoning and 62.74% overall for Memory-R1, and achieves a 65.8% win rate over a 31.07% naive RAG baseline in human evaluations.

Benchmark

MR.Rec: Synergizing Memory and Reasoning for Personalized Recommendation Assistant with LLMs

Jiani Huang, Xingchen Zou et al.

· 2025

MR.Rec unifies User-specific Local Memory, Cross-user Global Memory, Reasoning-enhanced Memory Retrieval, and Reinforcement Learning for Memory-synergized Reasoning into a single recommendation assistant pipeline. On the Amazon-C4–based benchmark, MR.Rec achieves NDCG@100 = 0.113 and Recall@100 = 0.270, improving over the best baseline Rec-R1 (NDCG@100 = 0.104, Recall@100 = 0.260).

Memory Architecture

Muon Outperforms Adam in Tail-End Associative Memory Learning

Shuche Wang, Fengzhuo Zhang et al.

· 2025

Muon Outperforms Adam in Tail-End Associative Memory Learning analyzes how VO attention weights, FFN matrices, normalized SVD entropy, and effective rank behave under Muon versus Adam in transformer associative memories. Muon Outperforms Adam in Tail-End Associative Memory Learning finds that applying Muon to VO and FFN nearly recovers full-Muon validation loss (3.5654 vs 3.9242 for All Adam at 10k steps on FineWeb) while improving tail-class accuracy on a heavy-tailed QA task compared to Adam and SGD+Momentum.

BenchmarkAgent Memory

PersonaMem-v2: Towards Personalized Intelligence via Learning Implicit User Personas and Agentic Memory

Bowen Jiang, Yuan Yuan et al.

· 2025

PersonaMem-v2 combines PERSONAMEM-V2: IMPLICIT PERSONAS, RL with Long-Context Reasoning, RL with Agentic Memory, and a User Privacy-Aware Design to train Qwen3-4B with GRPO on implicit user preferences from long, noisy histories. PersonaMem-v2 achieves 55.2% MCQ and 60.7% open-ended accuracy on PERSONAMEM-V2, surpassing GPT-5-Chat’s 45.6% and 46.2% while using a 2k-token agentic memory instead of full 32k–128k contexts.

Benchmark

Position: Episodic Memory is the Missing Piece for Long-Term LLM Agents

Mathis Pink, Qinyuan Wu et al.

· 2025

Episodic Memory is the Missing Piece for Long-Term LLM Agents proposes an architecture where in-context memory, external memory, and parametric memory are coordinated via consolidation, encoding, and retrieval to realize long-term, instance-specific, contextual episodic traces. Episodic Memory is the Missing Piece for Long-Term LLM Agents contributes a five-property taxonomy, a three-way memory categorization (in-context, external, parametric), and a roadmap of six research questions instead of benchmark gains.

BenchmarkBenchmarkLong-Term Memory

Pre-Storage Reasoning for Episodic Memory: Shifting Inference Burden to Memory for Personalized Dialogue

Sangyeop Kim, Yohan Lee et al.

· 2025

PREMem builds long term dialogue memory by combining Episodic Memory Extraction, Pre Storage Memory Reasoning, semantic clustering, a persistent memory pool, and an inference phase over enriched memory fragments. PREMem reaches 71.4 LLM as a judge on LongMemEval with gpt 4.1 base, a +15.5 gain over HippoRAG 2 and +9.6 over A Mem.

Benchmark

Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution

Zouying Cao, Jiaji Deng et al.

· 2025

ReMe manages procedural memory through experience acquisition, experience reuse, and experience refinement, combining multi-faceted distillation, context-adaptive reuse, and utility-based deletion into a single lifecycle. On BFCL-V3 and AppWorld, Qwen3-8B with ReMe (dynamic) achieves 34.94% Avg@4 vs 27.65% for the No Memory baseline, and 55.03% Pass@4 vs 46.20%, showing that self-evolving memory can substitute for model scale.

RAGBenchmarkAgent MemoryMemory Architecture

Semantic Anchoring in Agentic Memory: Leveraging Linguistic Structures for Persistent Conversational Context

Maitreyi Chatterjee, Devansh Agarwal

· 2025

Semantic Anchoring enriches conversational memory by combining a hybrid memory store with dense and symbolic indexes, structured memory representation tuples, hybrid storage and indexing, and a retrieval scoring method. On MultiWOZ-Long, Semantic Anchoring reaches 83.5% Factual Recall and 80.8% Discourse Coherence, beating Entity-RAG by 7.6 and 8.6 points respectively.

Benchmark

SGMem: Sentence Graph Memory for Long-Term Conversational Agents

Yaxiong Wu, Yongyue Zhang et al.

· 2025

SGMem organizes long conversations via SGMem Construction and Management, SGMem Usage, sentence level graphs, and multi hop retrieval over sessions, rounds, turns, summaries, facts, and insights. SGMem achieves 0.700 Accuracy (Top 5) on LongMemEval and 0.526 on LoCoMo, beating the RAG-SMFI baseline at 0.676 and 0.510 respectively.

RAGMemory Architecture

TeleMem: Building Long-Term and Multimodal Memory for Agentic AI

Chunliang Chen, Ming Guan et al.

· 2025

TeleMem converts interactions into unified semantic nodes via the representation layer, organizes them in a memory graph with Insert and ReInsert, and reads them using closure-based retrieval and a ReAct-style multimodal agent. On ZH-4O, TeleMem reaches 86.33% QA Accuracy, beating the Mem0 baseline at 70.20% and the RAG baseline at 62.45%.

Memory Architecture

Test-time regression: a unifying framework for designing sequence models with associative memory

Ke Alexander Wang, Jiaxin Shi, Emily B. Fox

· 2025

Test-time regression uses memorization as regression, memory retrieval, and test-time regression layers to reinterpret sequence architectures as solving a regression problem over key value pairs during the forward pass. This unification shows how linear attention, state space models, fast weight programmers, online learning layers, and softmax attention are all instances of the same framework and explains phenomena like linear attention’s failures and the role of query key normalization.

Benchmark

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Amir Zandieh, Majid Daliri et al.

· 2025

TurboQuant combines MSE Optimal TurboQuant, Inner-product Optimal TurboQuant, QJL, Random Rotation Matrix Π, and Lloyd-Max Quantizer to quantize vectors online with near-optimal distortion-rate guarantees. TurboQuant matches the Shannon lower bound within a factor of √(3π/2)≈2.7 for MSE and achieves absolute quality neutrality for KV cache quantization at 3.5 bits per channel compared to full-precision baselines.

Benchmark

Unable to Forget: Proactive Interference Reveals Working Memory Limits in LLMs Beyond Context Length

Chupei Wang, Jiaqiu Vince Sun

· 2025

PI-LLM uses a synthetic key–value retrieval task, Interference Endurance Score (IES), per-key forget prompts, and a mock QA reset to stress-test working-memory-like behavior under proactive interference. PI-LLM finds a universal log-linear decay in retrieval accuracy across 0.6B–637B-parameter LLMs as interference grows, revealing that parameter size, not context window length, predicts interference robustness.

Memory Architecture

Understanding Transformer from the Perspective of Associative Memory

Shu Zhong, Mingyu Xu et al.

· 2025

Understanding Transformer from the Perspective of Associative Memory reframes Softmax Attention, Linear Attention, FFN, and DeltaNet as instances of a unified associative memory with explicit memory capacity and update rules. Using this lens, Understanding Transformer from the Perspective of Associative Memory derives retrieval SNR for different kernels, unifies attention and FFNs, and proves that DeltaFormer achieves circuit complexity beyond TC0, reaching NC1 expressivity.

RAG

Understanding Users' Privacy Perceptions Towards LLM's RAG-based Memory

Shuning Zhang, Rongjun Ma et al.

· 2025

Understanding Users' Privacy Perceptions Towards LLM's RAG-based Memory analyzes users' mental models, privacy calculus, and expectations around RAG-based memory across generation, management, usage, and updating. Understanding Users' Privacy Perceptions Towards LLM's RAG-based Memory finds users demand explicit consent, fine-grained editing and deletion, and visibility into inferred information to trust RAG-based memory systems.

Agent Memory

Unveiling Privacy Risks in LLM Agent Memory

Bo Wang, Weiyi He et al.

· 2025

MEXTRA crafts black box attacking prompts and automated diverse prompt generators that target the memory module, similarity scoring function, retrieval depth, memory size, and LLM backbone. MEXTRA extracts 50 queries from a 200 record EHRAgent memory and 26 from RAP, with extracted efficiency up to 0.42 compared to weaker baselines without workflow aligned prompts.

Agent MemoryMemory Architecture

WebATLAS: An LLM Agent with Experience-Driven Memory and Action Simulation

Jiali Cheng, Anjishnu Kumar et al.

· 2025

WebATLAS combines a Planner, Actor, Critic, and Multi-layered Memory (Working Memory, Cognitive Map, Semantic Memory) to simulate and score actions before executing them on the web. On WebArena-Lite, WebATLAS achieves 63.0% average success versus 53.9% for Plan-and-Act, a +9.1 point gain without website-specific fine-tuning.

BenchmarkBenchmarkMemory Architecture

WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

Woongyeong Yeo, Kangsan Kim et al.

· 2025

WorldMM dynamically coordinates Episodic Memory, Semantic Memory, Visual Memory, an Adaptive Retrieval Agent, and a Response Agent to answer queries over hour- to week-long videos. On five long video QA benchmarks, WorldMM-GPT reaches 69.5% average accuracy, beating M3-Agent’s 55.1% by 14.4 points and the best prior memory baseline HippoRAG’s 57.0% by 12.5 points.

Benchmark

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Preston Rasmussen, Pavlo Paliychuk et al.

· 2025

Zep builds memory using Episode Subgraph, Semantic Entity Subgraph, Community Subgraph, and a three-stage Search–Reranker–Constructor pipeline over the Graphiti temporal knowledge graph. On LongMemEval, Zep with gpt-4o scores 71.2% vs a 60.2% full-context baseline and reduces average latency from 28.9 s to 2.58 s.

Benchmark

AriGraph: Learning Knowledge Graph World Models with Episodic Memory for LLM Agents

Petr Anokhin, Nikita Semenov et al.

· 2024

AriGraph builds a joint world model by combining semantic memory, episodic memory, semantic search, episodic search, and the Ariadne cognitive architecture into a single evolving graph. On NetHack, AriGraph lets Ariadne reach a score of 593.00 with room-only observations, compared to 341.67 for NetPlay with the same input and 675.33 for NetPlay with full level observations.

RAG

Deciphering the Interplay of Parametric and Non-parametric Memory in Retrieval-augmented Language Models

Mehrdad Farahani, Richard Johansson

· 2024

Deciphering the Interplay of Parametric and Non-parametric Memory instruments causal mediation analysis, Experiment 1, Experiment 2, and Path Specific Effects (PSE) inside ATLAS to trace how parametric and non-parametric memories compete token-by-token. Deciphering the Interplay of Parametric and Non-parametric Memory reports a strong shift toward counterfactual answers in altered contexts, with a t-test p-value of 1.60e-4 and Cohen’s d of -0.9851 for non-parametric versus parametric behavior.

Memory Architecture

Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers

Yibo Jiang, Goutham Rajendran et al.

· 2024

Do LLMs dream of elephants studies how a self-attention layer, value matrix, embedding matrix, latent concept association task, and context hijacking prompts interact to implement associative memory in transformers. Do LLMs dream of elephants proves theoretically (Theorem 1, Theorem 4) and empirically that a one-layer transformer can achieve arbitrarily small error on latent concept association by using the value matrix as associative memory.

Memory Architecture

Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models

Yabin Zhang, Wenjie Zhu et al.

arXiv 2024 · 2024

Dual Memory Networks combines a Dynamic Memory Network, Static Memory Network, a shared ReadOut module, Projection Layers ω, and a Memory Interactive Strategy to build sample-adaptive classifiers on top of frozen CLIP encoders. On zero-shot ImageNet with ViT-B/16, Dual Memory Networks achieves 72.25% accuracy vs 66.73% for CLIP and 68.98% for TPT, a +5.52 and +3.27 point gain respectively.

Benchmark

DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile Manipulation

Peiqi Liu, Zhanqiu Guo et al.

· 2024

DynaMem maintains a Dynamic 3D Voxel Map, supports Embedded Vision Language Features and Multimodal Large Language Models querying, and exposes Exploration Primitives and an Obstacle map for navigation and manipulation. On real Stretch SE3 experiments, DynaMem achieves a 70% success rate on dynamic pick-and-drop tasks compared to 30% for the static OK-Robot baseline.

Benchmark

Efficient Episodic Memory Utilization of Cooperative Multi-Agent Reinforcement Learning

Hyungho Na, Yunkyeong Seo, Il-chul Moon

· 2024

EMU combines a semantic memory embedding via deterministic conditional autoencoder and an episodic incentive built from desirability in the episodic buffer to guide cooperative MARL. EMU improves learning speed and final win-rates on StarCraft II SMAC and Google Research Football compared to QPLEX, CDS, and EMC, especially on super hard maps.

RAG

Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation

Quanting Xie, So Yeon Min et al.

· 2024

Embodied-RAG builds a multimodal Topological Map and a hierarchical Semantic Forest and then runs Top-down Retrieval with LLM-based selection and hybrid re-ranking to drive Generation of waypoints and explanations. On the E-multimodal Embodied-Experiences dataset, Embodied-RAG reaches P(Q|A)=0.67 for implicit queries (Q only), compared to 0.13 for LightRAG, while building graph memory 9.76× faster than LightRAG.

RAG

"Ghost of the past": identifying and resolving privacy leakage from LLM's memory through proactive user interaction

Shuning Zhang, Lyumanshan Ye et al.

· 2024

MemoAnalyzer analyzes past inputs and long-term memories using prompt-based privacy inference, confidence and sensitivity visualization, and source tracking with an editing proxy. In a 5-day study on work, life, and academic tasks, MemoAnalyzer reduced total inferred private information by 22.3% compared to GPT memory while keeping completion time comparable to GPT and Manual baselines.

Benchmark

HiAgent: Hierarchical Working Memory Management for Solving Long-Horizon Agent Tasks with Large Language Model

Mengkang Hu, Tianxing Chen et al.

· 2024

HiAgent manages working memory using Subgoal based Hierarchical Working Memory, Observation Summarization, and a Trajectory Retrieval module to chunk and selectively expand action observation histories. On five AgentBoard long horizon tasks, HiAgent reaches 42.00% overall success rate versus 21.00% for STANDARD, with 3.80 fewer average steps.

Benchmark

Human-inspired Episodic Memory for Infinite Context LLMs

Zafeirios Fountas, Martin A Benfeghoul et al.

· 2024

EM-LLM organises long token streams into episodic events via surprise-based segmentation, boundary refinement, and a two-stage memory retrieval with similarity and contiguity buffers. On LongBench with LLaMA-3.1-8B, EM-LLM reaches 51.58% average score versus 39.3% for full-context processing and 36.44% for RAG.

Benchmark

Improving Factuality with Explicit Working Memory

Mingda Chen, Yang Li et al.

· 2024

Ewe augments Llama-3.1 with an explicit working memory, real-time feedback, fact-checking outcomes, and relevant knowledge memories that are refreshed via FIFO KV-cache updates during decoding. On the Biography dataset, Ewe reaches 49.7 VeriScore F1 versus 37.1 for Llama-3.1 70B (+12.6), while keeping AlpacaEval win rate around 50% against the same baseline.

Benchmark

Larimar: Large Language Models with Episodic Memory Control

Payel Das, Subhajit Chaudhury et al.

· 2024

Larimar couples a BERT large encoder, a deterministic hierarchical memory module, a scope detector, and a GPT2-large or GPTJ-6B decoder via learned projection WM to perform memory-conditioned generation. On CounterFact and ZsRE, Larimar attains up to 100.0% edit success and 0.97 edit retention rate while being 4–10× faster than ROME and GRACE.

Benchmark

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu, Hongwei Wang et al.

ICLR 2025 · 2024

LongMemEval evaluates long-term interactive memory by running chat assistants through indexing, retrieval, and reading over 50k sessions with fact-augmented keys and time-aware query expansion. On LONGMEMEVALS, long-context LLMs like GPT-4o, Llama 3.1, and Phi-3 suffer 30%–60% accuracy drops compared to oracle evidence-only reading, revealing severe limitations in current long-context designs.

Benchmark

Long Term Memory: The Foundation of AI Self-Evolution

Xun Jiang, Feng Li et al.

· 2024

Long Term Memory equips OMNE with a Data Framework for LTM, Development Framework for LTM, and multi-agent collaboration over personalized memories to support AI self-evolution. On the GAIA benchmark, OMNE reaches first place, showing that LTM-driven multi-agent personalization can solve complex real-world tasks better than prior agent systems.

Benchmark

MemLong: Memory-Augmented Retrieval for Long Text Modeling

Weijie Liu, Zecheng Tang et al.

· 2024

MemLong augments OpenLLaMA-3B with a Ret-Mem module, Memory Bank, Retriever, Retrieval Causal Attention, and Dynamic Memory Update to store and retrieve chunk-level K-V caches via dense embeddings. On PG19 at 16k tokens, MemLong with 32K Memory reaches perplexity 9.73 vs 10.37 for MemLong-3B* without memory, and achieves up to +10.2 percentage points over OpenLLaMA on retrieval-augmented in-context learning.

Benchmark

MemoNav: Working Memory Model for Visual Navigation

Hongxin Li, Zeyu Wang et al.

· 2024

MemoNav composes Short-term memory, Selective forgetting module, Long-term memory, Working memory generation, and Transformer decoders to focus navigation on goal-relevant topological map nodes. On Gibson 1-goal, MemoNav reaches 74.7% SR vs 70.0% for VGM, and on Gibson 4-goal multi-goal tasks MemoNav achieves 28.9% PR vs 21.5% for VGM.

Long-Term Memory

PerLTQA: A Personal Long-Term Memory Dataset for Memory Classification, Retrieval, and Synthesis in Question Answering

Yiming Du, Hongru Wang et al.

· 2024

PerLTQA builds a synthetic personal memory database and runs questions through Memory Classification, Memory Retrieval, and Memory Synthesis to test how LLMs use semantic and episodic memories. On the PerLTQA benchmark, BERT-base achieves 95.7 F1 for memory classification, while gpt-3.5-turbo reaches MAP 0.756 for memory synthesis with retrieval and classification.

RAG

Retrieval-Augmented Decision Transformer: External Memory for In-context RL

Thomas Schmied, Fabian Paischer et al.

· 2024

Retrieval-Augmented Decision Transformer (RA-DT) combines a vector index, embedding model g(·), maximum inner product search, experience reweighting, and cross-attention layers to retrieve and fuse relevant sub-trajectories into a Decision Transformer policy. On Dark-Room 10×10, RA-DT reaches near-optimal average reward over 40 in-context trials while using a 50-step context window, whereas baselines like Algorithm Distillation require entire episodes of up to 100 steps.

Memory Architecture

Self-evolving Agents with reflective and memory-augmented abilities

Xuechen Liang, Yangfan He et al.

· 2024

SAGE coordinates Iterative Feedback, Reflection, Short-Term Memory, Long-Term Memory, and MemorySyntax so the assistant, checker, and user co-evolve policies and memories over time. On AgentBench and long-context QA like HotpotQA, SAGE lifts GPT-3.5’s Database score from 25.9 to 37.6 and HotpotQA answer accuracy from 48.5% to 68.3%.

RAG

Toward Conversational Agents with Context and Time Sensitive Long-term Memory

Nick Alonso, Tomás Figliolia et al.

· 2024

Toward Conversational Agents with Context and Time Sensitive Long-term Memory integrates a Tabular Chat Database, Classifying Query Type, Chain-of-Tables for Meta-Data Retrieval, and Combining Meta-Data and Semantic Retrieval to handle time-sensitive and ambiguous conversational queries. On the LoCoMo-derived temporal benchmark, Toward Conversational Agents with Context and Time Sensitive Long-term Memory achieves 90.32 average recall vs 31.93 for the best Semantic w MetaD baseline.

Benchmark

TransformerFAM: Feedback attention is working memory

Dongseong Hwang, Weiran Wang et al.

· 2024

TransformerFAM augments Block Sliding Window Attention with Feedback Attention Memory, where Feedback Attention Memory (FAM) tokens attend to and compress block activations while queries jointly attend to BSWA memory segments and past FAM. On PassKey retrieval, TransformerFAM maintains 100% accuracy up to 260k filler tokens, while TransformerBSWA with 12 memory segments collapses after 20k tokens.

Memory Architecture

Understanding Factual Recall in Transformers via Associative Memories

Eshaan Nichani, Jason D. Lee, Alberto Bietti

· 2024

Understanding Factual Recall in Transformers via Associative Memories analyzes linear associative memories, MLP associative memories, and a one-layer transformer with multi-head self-attention plus an MLP on a synthetic factual recall task. Understanding Factual Recall in Transformers via Associative Memories proves that storing N random associations requires Θ(N log M) bits and that a single-layer transformer achieves 100% accuracy whenever either self-attention or MLP parameters scale linearly with the number of facts.

Long-Term Memory

Understanding the Impact of Long-Term Memory on Self-Disclosure with Large Language Model-Driven Chatbots for Public Health Intervention

Eunkyung Jo, Yuin Jeong et al.

· 2024

CareCall combines a memory management layer, LLM summarizer, and memory-augmented input over HyperCLOVA to store and reuse summaries of users’ Health, Meals, Sleep, Visited Places, and Pets across weekly calls. In deployment to 147 socially isolated adults, CareCall with long-term memory yielded higher Health-detail and Clinical-detail disclosure counts per call than CareCall without memory, and longer average call durations (87.89s vs 75.48s).

Long-Term Memory

Augmenting Language Models with Long-Term Memory

Weizhi Wang, Li Dong et al.

· 2023

LONGMEM augments a frozen GPT-2*-style backbone with a Residual SideNet, Cached Memory Bank, Memory Retrieval and Fusion, and Cross-Network Residual Connections to read and use long-term key–value memories. On ChapterBreak AO3, LONGMEM reaches 40.5% suffix identification accuracy with infinite in-memory context, compared to 28.3% for Memorizing Transformer under the same 1k in-context window.

Benchmark

Empowering Working Memory for Large Language Model Agents

Jing Guo, Nan Li et al.

· 2023

Empowering Working Memory for Large Language Model Agents introduces a Working Memory Hub, Episodic Buffer, Interaction History Window, Central Processor, and External Environment Interface to give LLM agents persistent, structured working and episodic memory. Empowering Working Memory for Large Language Model Agents is a conceptual blueprint rather than a benchmarked system, so no quantitative MAIN_RESULT against specific baselines is reported.

Benchmark

Graph-level Anomaly Detection via Hierarchical Memory Networks

Chaoxi Niu, Guansong Pang, Ling Chen

arXiv 2023 · 2023

HimNet combines a GNN Encoder, Node Memory Module, Graph Memory Module, and Graph Decoder to reconstruct graphs via stored normal patterns and score anomalies by reconstruction and approximation errors. On the DD biochemical dataset, HimNet achieves 80.6% AUC compared to 70.6% for PK-iF, a +10.0 point gain over this two-step baseline.

Memory Architecture

Hierarchical Neural Memory Network for Low Latency Event Processing

Ryuhei Hamaguchi, Yasutaka Furukawa et al.

arXiv 2023 · 2023

Hierarchical Neural Memory Network (HMNet) stacks multi-level latent memories z1–z3 with Event-write, Up-write, Down-write, Update, and Readout operations driven by Event Sparse Cross Attention. On DSEC-Semantic, HMNet-L3 reaches 57.4 mIoU with event–RGB fusion, improving over a ResNet-50 baseline at 54.1 mIoU while also reducing latency, and on GEN1 HMNet-B1 matches AED at similar mAP with 57% lower latency.

Long-Term Memory

In search of dispersed memories: Generative diffusion models are associative memory networks

Luca Ambrogioni

· 2023

In search of dispersed memories connects associative memory networks, modern Hopfield networks, generative diffusion models, and the denoising loss to show that diffusion score networks encode Hopfield-like energy landscapes in their weights. In search of dispersed memories demonstrates that exact diffusion dynamics reach Pearson correlations of 0.995–0.996 with modern Hopfield iterations on denoising and completion tasks, while classical Hopfield networks reach only 0.700–0.741.

Long-Term Memory

LLM-based Medical Assistant Personalization with Short- and Long-Term Memory Coordination

Kai Zhang, Yangyang Kang et al.

· 2023

MaLP combines a Dual-Process enhanced Memory (DPeM), Working Memory, Short-Term Memory (STM), Long-Term Memory (LTM), a Coordinator C, and Retriever R to coordinate short- and long-term personalization around a PEFT-tuned LLM. On MaLP’s medical dialogue benchmark with LLaMA-7B, MaLP achieves 69.95% preference classification accuracy and a 91.53% win rate in response generation, improving ROUGE-L profile QA from 29.66 to 33.91 over the LoRA baseline.

Memory Architecture

MBPTrack: Improving 3D Point Cloud Tracking with Memory Networks and Box Priors

Tian-Xing Xu, Yuan-Chen Guo et al.

arXiv 2023 · 2023

MBPTrack combines a Decoupling Feature Propagation Module, BPLocNet, box-prior sampling, and point-to-reference aggregation to track 3D objects from point clouds using temporal memory and size-aware localization. On KITTI, MBPTrack achieves 70.3% Success and 87.9% Precision, improving over CXTrack’s 67.5%/85.3% by +2.8/+2.6.

Benchmark

MemoryBank: Enhancing Large Language Models with Long-Term Memory

Wanjun Zhong, Lianghong Guo et al.

· 2023

MemoryBank combines Memory Storage, Memory Retrieval, and a Memory Updating Mechanism to maintain daily conversations, event summaries, and user portraits for long-term personalization. On a 10-day, 15-user simulated dialog benchmark with 194 probing questions, MemoryBank-powered SiliconFriend ChatGPT achieves 0.716 correctness and 0.912 contextual coherence, surpassing SiliconFriend ChatGLM and SiliconFriend BELLE.

Benchmark

Recurrent Neural Networks and Long Short-Term Memory Networks: Tutorial and Survey

Benyamin Ghojogh, Ali Ghodsi

arXiv 2023 · 2023

Recurrent Neural Networks and Long Short-Term Memory Networks explains how Backpropagation Through Time, LSTM gates and cells, Gated Recurrent Units, bidirectional RNN, and ELMo fit into one dynamical-systems view. Recurrent Neural Networks and Long Short-Term Memory Networks mainly contributes a structured tutorial and survey rather than new benchmark numbers.

Benchmark

Working Memory Capacity of ChatGPT: An Empirical Study

Dongyu Gong, Xingchen Wan, Dingmin Wang

· 2023

Working Memory Capacity of ChatGPT probes ChatGPT with verbal and spatial n-back blocks, adding noise, feedback, and chain-of-thought variants to stress working memory. Working Memory Capacity of ChatGPT finds d′ falls to around 1 at n=3 in most conditions and that GPT-4’s verbal n-back capacity far exceeds Bloomz, ChatGLM, and Vicuna baselines.

Memory Architecture

BayesPCN: A Continually Learnable Predictive Coding Associative Memory

Jason Yoo, Frank Wood

arXiv 2022 · 2022

BayesPCN combines predictive coding, conjugate Bayesian updates over W 0:L, sequential importance sampling, and a diffusion-based forget mechanism to build a hierarchical associative memory that supports continual one-shot writes. On CIFAR10 and Tiny ImageNet hetero-associative tasks, BayesPCN matches offline GPCN with MSE as low as 0.0000 while online GPCN rises to 0.0791 MSE on CIFAR10 mask at sequence length 1024.

Benchmark

Episodic Memory Question Answering

Samyak Datta, Sameer Dharur et al.

· 2022

Episodic Memory Question Answering combines allocentric top-down semantic features, spatiotemporal memory, and a LingUNet-based question-answering model to localize answers on scene floorplans from egocentric RGB-D tours. On the EMQA benchmark built from Matterport3D tours, Episodic Memory Question Answering with temporal features achieves 29.11 IoU and 62.27 recall on top-down maps, improving over SMNetDecoder by 2.19 IoU and 18.41 recall.

Long-Term Memory

Evaluating Long-Term Memory in 3D Mazes

Jurgis Pasukonis, Timothy Lillicrap, Danijar Hafner

· 2022

Memory Maze combines an online reinforcement learning environment, an offline dataset, and an offline probing protocol to stress-test long-term memory using Dreamer, Dreamer (TBTT), IMPALA, and supervised probe networks. On Memory 9x9, Dreamer (TBTT) achieves a return of 33.2 compared to 23.4 for IMPALA, while humans reach 26.4 and the oracle reaches 34.8.

Benchmark

Large Language Models with Controllable Working Memory

Daliang Li, Ankit Singh Rawat et al.

· 2022

Knowledge Aware Finetuning (KAFT) augments QA data with relevant context, irrelevant context, counterfactual context, and empty context built from SQuAD 2.0, T-REx, QASC, and TriviaQA, and uses pretrained model’s answer to label irrelevant slices. KAFT achieves up to 24× higher controllability than Noisy Finetuning and up to 6× higher robustness than Noisy Finetuning on PaLM 540B while matching TriviaQA validation accuracy.

Memory Architecture

Universal Hopfield Networks: A General Framework for Single-Shot Associative Memory Models

Beren Millidge, Tommaso Salvatori et al.

· 2022

Universal Hopfield Networks decompose single-shot associative memory into similarity, separation, and projection, instantiating Hopfield networks, sparse distributed memories, dense associative memories, and modern continuous Hopfield networks within one energy-based framework. Universal Hopfield Networks then compare dot product, Euclidean, Manhattan, and other similarity functions, finding that Manhattan and Euclidean similarity often yield higher retrieval capacity and robustness than dot-product-based modern continuous Hopfield networks on MNIST, CIFAR10, and Tiny ImageNet.

Benchmark

Generalizable Episodic Memory for Deep Reinforcement Learning

Hao Hu, Jianing Ye et al.

arXiv 2021 · 2021

Generalizable Episodic Memory (GEM) combines a parametric network Mθ, implicit memory-based planning, twin back-propagation process, and conservative estimation on single step to generalize episodic returns across trajectories. GEM achieves higher average returns than TD3, SAC, DDPG, and TD3+SIL on MuJoCo tasks like Ant-v2, HalfCheetah-v2, and Humanoid-v2 within 1M environment steps.

Memory Architecture

Hierarchical Associative Memory

Dmitry Krotov

arXiv 2021 · 2021

Hierarchical Associative Memory organizes fully recurrent Modern Hopfield Networks into layered architectures using Lagrangian functions, hierarchical time scales, and symmetric feedforward–feedback weights as core components. Hierarchical Associative Memory theoretically extends Dense Associative Memories to arbitrary depth and local connectivity, deriving explicit dynamical and energy functions rather than reporting benchmark numbers.

Benchmark

Offline Reinforcement Learning with Value-based Episodic Memory

Xiaoteng Ma, Yiqin Yang et al.

arXiv 2021 · 2021

Value-based Episodic Memory (VEM) uses Expectile V-Learning (EVL), Implicit Memory-based Planning, and Generalized Advantage-weighted Learning to learn conservative V-values and plan along offline trajectories. On the D4RL benchmark, VEM attains 87.5 on antmaze-umaze and 128.3 on adroit-hammer-expert, improving over BAIL and CQL on most AntMaze and Adroit tasks.

Benchmark

Solving Continuous Control with Episodic Memory

Igor Kuznetsov, Andrey Filchenkov

arXiv 2021 · 2021

Episodic Memory Actor Critic (EMAC) augments DDPG with a Memory Module, Episodic-based Experience Replay Prioritization, and a modified critic objective that blends Bellman targets with retrieved Monte Carlo returns. On OpenAI Gym continuous control, EMAC reaches 2236.88 ± 808 average return on Walker2d-v3, beating TD3 (1008.32) by +1228.56 and SAC (1787.28) by +449.6 in the 200000-step small-data regime.

Benchmark

Working Memory Connections for LSTM

Federico Landi, Lorenzo Baraldi et al.

arXiv 2021 · 2021

Working Memory Connections for LSTM (LSTM-WM) augments LSTM gates, Working Memory Connections, peephole connections, and the memory cell so gates see a protected projection of the cell state. On PTB character-level language modeling, LSTM-WM achieves 1.299 BPC with TPTB=150 compared to 1.334 BPC for vanilla LSTM.

Memory Architecture

Emergent Symbols through Binding in External Memory

Taylor W. Webb, Ishan Sinha, Jonathan D. Cohen

arXiv 2020 · 2020

Emergent Symbol Binding Network (ESBN) combines an LSTM controller, a shared image encoder fe, temporal context normalization, and a two column key value memory to bind abstract variables to concrete image embeddings. ESBN achieves ≥95% test accuracy on same different, RMTS, distribution of three, and identity rules tasks while generalizing to withheld Unicode characters, unlike LSTM, NTM, MNM, Relation Net, Transformer, and PrediNet baselines.

Memory Architecture

Large Associative Memory Problem in Neurobiology and Machine Learning

Dmitry Krotov, John Hopfield

arXiv 2020 · 2020

Large Associative Memory Problem rewrites associative memory using coupled feature neurons, memory neurons, an energy function, and Lagrangian functions so that all interactions are pairwise yet recover Dense and modern Hopfield behavior. Large Associative Memory Problem shows that with appropriate choices of activation and Lagrangian functions, the effective dynamics match Dense Associative Memories and modern Hopfield networks, enabling storage of N_mem ∼ min(N_f^{n−1}, N_h) or even exponential in N_f memories without many‑body synapses.

Benchmark

Learning to Learn Variational Semantic Memory

Xiantong Zhen, Yingjun Du et al.

arXiv 2020 · 2020

Variational Semantic Memory combines variational prototype inference, variational semantic memory, latent memory m, and an attention-based memory update to build probabilistic class prototypes from long-term semantic knowledge. On miniImageNet 5-way 1-shot with a deep backbone, Variational Semantic Memory reaches 65.72% accuracy versus 64.82% for Tian et al. 2020.

Memory Architecture

MEMO: A Deep Network for Flexible Combination of Episodic Memories

Andrea Banino, Adrià Puigdomènech Badia et al.

· 2020

MEMO combines common embeddings, multi head keys and values, recurrent attention, and a halting policy to flexibly chain episodic memories over multiple hops. On joint bAbI 10k, MEMO achieves 0.21% error versus 4.2% for Memory Networks, while also solving long-distance Paired Associative Inference and shortest path tasks.

Memory Architecture

Self-Attentive Associative Memory

Hung Le, Truyen Tran, Svetha Venkatesh

· 2020

Self-Attentive Associative Memory (STM) combines Outer Product Attention (OPA), Self-attentive Associative Memory (SAM), Mi-Write, Mr-Read, and Mr-Transfer into a dual item–relational memory system. On the bAbI question answering benchmark, STM attains 0.39 ± 0.18 mean error versus 0.55 ± 0.74 for MNM-p, establishing a new state-of-the-art.

Memory Architecture

Adaptive Posterior Learning: few-shot learning with a surprise-based memory module

Tiago Ramalho, Marta Garnelo

arXiv 2019 · 2019

Adaptive Posterior Learning combines an Encoder, Memory store, Memory controller, and Decoder (relational self-attention, relational working memory, or LSTM) to approximate posteriors from a sparse external memory. On Omniglot, Adaptive Posterior Learning achieves 99.9% 5-way 5-shot accuracy, matching MAML and SNAIL, while using fewer than 2 stored examples per class.

Benchmark

Episodic Memory in Lifelong Language Learning

Cyprien de Masson d'Autume, Sebastian Ruder et al.

arXiv 2019 · 2019

Episodic Memory in Lifelong Language Learning combines an example encoder, task decoder, and episodic memory for sparse experience replay and Memory-based Parameter Adaptation (MBPA++). Episodic Memory in Lifelong Language Learning attains 70.6 averaged classification accuracy vs 66.9 for A-GEM and 62.4 QA F1 vs 57.9 for REPLAY on concatenated multi-dataset streams.

Benchmark

Episodic Memory Reader: Learning What to Remember for Question Answering from Streaming Data

Moonsu Han, Minki Kang et al.

arXiv 2019 · 2019

Episodic Memory Reader combines a Data Encoder, Memory Encoder (EMR-Independent, EMR-biGRU, EMR-Transformer), Value Network, external memory, and QA solver to learn which streaming items to keep. On TriviaQA, Episodic Memory Reader with EMR-biGRU achieves 57.57 F1 versus 50.10 for LIFO, and on TVQA Episodic Memory Reader reaches about 65% accuracy with 60 memory entries versus roughly 55% for LIFO.

Benchmark

Generalization of Reinforcement Learners with Working and Episodic Memory

Meire Fortunato, Melissa Tan et al.

arXiv 2019 · 2019

Memory Recall Agent (MRA) integrates a pixel‑input convolutional residual network, an LSTM working memory, a slot‑based episodic memory, an auxiliary contrastive loss, and jumpy backpropagation into a single reinforcement learning agent. Across the Memory Tasks Suite with train/holdout splits, Memory Recall Agent (MRA) attains the highest average human‑normalized performance compared to LSTM‑only IMPALA and other ablations, especially on tasks requiring episodic recall.

Benchmark

On Tiny Episodic Memories in Continual Learning

Arslan Chaudhry, Marcus Rohrbach et al.

arXiv 2019 · 2019

On Tiny Episodic Memories in Continual Learning combines Experience Replay, Reservoir Sampling, Ring Buffer, k-Means, and Mean of Features memory writing to jointly train on current-task data and a tiny episodic memory. On Split CIFAR with only 1 example per class, On Tiny Episodic Memories in Continual Learning reaches about 0.56 average accuracy, a +15.6 percentage point gain over FINETUNE and +15 percentage points over EWC.

Benchmark

Working Memory Graphs

Ricky Loynd, Roland Fernandez et al.

arXiv 2019 · 2019

Working Memory Graphs combines a Core vector, Factor vectors, and persistent Memo vectors processed by a multi layer Transformer to implement shortcut recurrence over past observations. On the Pathfinding task, Working Memory Graphs with Memos nearly matches a full history non recurrent Working Memory Graphs baseline and exceeds a GRU agent by roughly 9.5 percentage points in zero shot quiz accuracy on 24 step episodes.

Benchmark

A Dataset and Architecture for Visual Reasoning with a Working Memory

Guangyu Robert Yang, Igor Ganichev et al.

arXiv 2018 · 2018

COG combines Visual processing, Semantic processing, Visual short-term memory, and a Controller so SYS_NAME can parse instructions, attend over images, and maintain working memory. SYS_NAME reaches 96.8% overall accuracy on CLEVR versus 95.5% for CNN+LSTM+RN, while achieving 93.7% on canonical COG and strong zero-shot task generalization.

Benchmark

Chinese Poetry Generation with a Working Memory Model

Xiaoyuan Yi, Maosong Sun et al.

arXiv 2018 · 2018

Working Memory model combines a topic memory, history memory, local memory, genre embedding, and Topic Trace mechanism inside a GRU encoder–decoder to dynamically read and write salient poem context. On Chinese quatrains, Working Memory model achieves BLEU 1.315 and perplexity 86 versus iPoet’s BLEU 0.425 and perplexity 138.

Benchmark

Deep Episodic Memory: Encoding, Recalling, and Predicting Episodic Experiences for Robot Action Execution

Jonas Rothfuss, Fabio Ferreira et al.

arXiv 2018 · 2018

Deep Episodic Memory uses an encoder network E, reconstruction-decoder Dr, prediction-decoder Dp, latent vector V, and a matching and retrieval mechanism to turn raw video into episodic encodings that can be reconstructed and predicted. On ActivityNet, Deep Episodic Memory with PCA achieves 45.55% first-match precision versus 32.31% for ResNet-50 Fisher Vectors, a +13.24 percentage point gain.

Benchmark

Episodic Memory Deep Q-Networks

Zichuan Lin, Tianqi Zhao et al.

arXiv 2018 · 2018

Episodic Memory Deep Q-Networks (EMDQN) augments Qθ(s, a) with an inference target S, an episodic memory target H, and a memory table built via random projection and kd-tree lookup. On 57 Atari games at 40M frames, EMDQN achieves a 528.4% mean human-normalized score versus 151.2% for DQN and 144.8% for NEC.

Memory Architecture

Progressive Memory Banks for Incremental Domain Adaptation

Nabiha Asghar, Lili Mou et al.

arXiv 2018 · 2018

Progressive Memory Banks for Incremental Domain Adaptation augments a BiLSTM with a directly parameterized memory bank, key value memory, and an attention based memory mechanism that is progressively expanded during incremental domain adaptation. On MultiNLI Fic→Gov, Progressive Memory Banks for Incremental Domain Adaptation with memory and vocabulary expansion reaches 67.55% on Fic and 70.82% on Gov, compared to 65.62% and 69.90% for fine tuning with no memory expansion.

Memory Architecture

Emotional Chatting Machine: Emotional Conversation Generation with Internal and External Memory

Hao Zhou, Minlie Huang et al.

arXiv 2017 · 2017

Emotional Chatting Machine (ECM) augments a GRU encoder decoder with Emotion Category Embedding, Internal Memory, and External Memory to control emotional content in generated replies. On the Emotional STC dataset, ECM reaches 0.773 emotion accuracy vs 0.724 for Emb and 0.179 for Seq2Seq, while keeping perplexity comparable.

Benchmark

Gradient Episodic Memory for Continual Learning

David Lopez-Paz, Marc'Aurelio Ranzato

arXiv 2017 · 2017

Gradient Episodic Memory stores task examples in episodic memory Mt, constrains updates via inequality constraints on past-task losses, and solves a small quadratic program (GEM QP) to project gradients. Gradient Episodic Memory achieves 0.654 ACC on Incremental CIFAR100 with 5,120 memory slots, compared to 0.508 ACC for iCARL.

Long-Term Memory

Neural SLAM: Learning to Explore with External Memory

Jingwei Zhang, Lei Tai et al.

arXiv 2017 · 2017

Neural SLAM combines an LSTM, Localization and Motion Prediction, Data Association, Measurement Update, and Mapping over an external memory map to guide exploration policies. On 16×16 grid worlds, Neural SLAM achieves 13.732 average reward and 46/50 success episodes, a +6.536 reward gain over A3C-Nav2.

Long-Term Memory

On the Long-Term Memory of Deep Recurrent Networks

Yoav Levine, Or Sharir et al.

arXiv 2017 · 2017

On the Long-Term Memory of Deep Recurrent Networks analyzes Recurrent Arithmetic Circuits, Start-End separation rank, grid tensors, and Tensor Network constructions to quantify how depth affects temporal expressivity. The main result proves depth-2 RACs achieve Start-End separation rank on the order of the multiset coefficient (min{M,R} + T/2 − 1 choose T/2), while depth-1 RACs are limited to rank min{R, M^{T/2}}.

About

Why we built this

Memory is the missing piece of truly useful AI. Without memory, every conversation starts from scratch, with no context, no personalization, and no real understanding of who you are or what you need.

At Mem0, we're building the memory layer for AI. This site is our way of sharing the research that inspired and informs our work, made accessible to everyone, not just academics.

Each paper here represents a step toward AI systems that genuinely remember, learn, and improve over time. We hope this collection helps you understand where the field is headed.