Agent MemoryLong-Term Memory
Guilin Zhang, Wei Jiang et al.
· 2026
A-MAC scores candidate memories using Utility, Confidence, Novelty, Recency, and Type Prior combined by a learned linear admission policy with Algorithm 1 A-MAC Memory Admission. On the LoCoMo benchmark, A-MAC achieves F1 0.583 and 2644 ms latency, improving F1 by 0.042 and reducing latency by 1187 ms compared to A-mem.
Long-Term Memory
Robbyant Team, Zelin Gao et al.
arXiv 2026 · 2026
LingBot-World combines a Data Engine, Fundamental World Model, Action-Conditioned World Model, and Post-Training causal adaptation to turn a 28B-parameter video generator into a real-time interactive world simulator. On the VBench benchmark, LingBot-World achieves a dynamic degree of 0.8857 versus 0.7612 for Yume-1.5, while also improving imaging quality to 0.6683.
BenchmarkBenchmarkLong-Term Memory
Manoj Madushanka Perera, Adnan Mahmood et al.
· 2026
AgenticAI-DialogGen chains ChatPreprocessor, KnowledgeExtractor, TopicAnalyzer, KnowledgeGraphBuilder, PersonaGenerator, DuelingChat Agent, ConversationValidator, ConversationRefiner, QAGeneration, and PostProcessing to turn raw multi-session chats into topic-guided, persona-grounded conversations with explicit short- and long-term memories. On the TGC / KG memory QA benchmark, Mistral-7B fine-tuned within AgenticAI-DialogGen achieves 87.36 F1, compared to GPT-4’s 83.77 F1 in a zero-shot setting on the same task.
Long-Term Memory
Can Lv, Heng Chang et al.
· 2026
All-Mem organizes long-term agent memory through Online/Offline Decoupling, Agentic Topology Consolidation, and Topology-Aware Retrieval over a topology-structured memory bank. On LoCoMo, All-Mem reaches 54.63 4o-J versus Mem0’s 48.91, and on LongMemEval-S All-Mem reaches 60.20 4o-J versus Mem0’s 55.80.
Agent MemoryLong-Term Memory
Weiquan Huang, Zixuan Wang et al.
· 2026
AMA orchestrates four agents — the Constructor, Retriever, Judge, and Refresher — to build Raw Text, Fact Knowledge, and Episode Memory and route queries adaptively across these granularities. On the LoCoMo benchmark with GPT-4.1-mini, AMA achieves an overall LLM Score of 0.805 compared to Nemori’s 0.774, while reducing token consumption by approximately 80% relative to FullContext.
BenchmarkBenchmarkLong-Term Memory
Deliang Wen, Ke Sun, Yu Wang
· 2026
A-MBER builds multi-session conversational scenarios via a staged pipeline of persona specification, long-horizon planning, conversation generation, annotation, question construction, and benchmark-unit packaging. On A-MBER, a structured memory system reaches 0.69 judgment accuracy, 0.66 retrieval, and 0.65 explanation versus 0.34, 0.29, and 0.31 for a no-memory baseline.
Agent MemoryLong-Term Memory
AMV-L manages agent memory using a Memory Value Model, Tiered Lifecycle, Bounded Retrieval Path, and Lifecycle Manager to decouple retention from retrieval eligibility. Under a 70k-request long-running workload, AMV-L improves throughput from 9.027 to 36.977 req/s over TTL and reduces p99 latency from 5398.167 ms to 1233.430 ms while matching LRU’s retrieval quality.
SurveyBenchmarkAgent MemoryLong-Term MemoryMemory Architecture
Zehao Lin, Chunyu Li, Kai Chen
· 2026
Mnemonic Sovereignty analyzes long term Write, Store, Retrieve, Execute, Share, and Forget Rollback phases against integrity, confidentiality, availability, and governance objectives for agent memory. Mnemonic Sovereignty’s lifecycle matrix shows most of the ~70 works cluster on write and retrieve integrity, leaving store, availability, and governance primitives like write gate validation and post deletion verification almost entirely unexplored.
BenchmarkAgent MemoryLong-Term Memory
Zexue He, Yu Wang et al.
· 2026
MEMORYARENA orchestrates Memory-Agent-Environment Loops, Multi-Session Working Flow, Bundled Web Shopping, Group Travel Planning, and Progressive Web Search to stress-test how agents store and reuse information across sessions. MEMORYARENA’s main result is that agents with near-saturated scores on long-context benchmarks like LoCoMo still obtain Task Success Rates as low as 0.00–0.12 across its four environments.
BenchmarkBenchmarkLong-Term Memory
Sangyeon Yoon, Sunkyoung Kim et al.
· 2026
BenchPreS combines Contexts, User Profiles, Preference Attributes, Gold Labeling, and an LLM-as-Judge framework to test context-aware preference selectivity in persistent-memory LLMs. BenchPreS shows GPT-5.2 reaches 87.33% Appropriate Application Rate on BenchPreS while still having a 40.95% Misapplication Rate compared to Gemini 3 Pro’s 86.48% Misapplication Rate.
Long-Term Memory
Sahil Sen, Elias Lumer et al.
· 2026
Chronos decomposes dialogue into structured events via the Event Extraction pipeline, stores them in dual event calendar and turn calendar indexes, and uses Dynamic Prompting, Initial Retrieval, and the Chronos Agent for temporal-aware tool-calling. On LongMemEvalS, Chronos Low reaches 92.60% overall accuracy and Chronos High 95.60%, beating EmergenceMem Internal by 7.67 percentage points and Mastra’s OM by 3.02 points.
BenchmarkAgent MemoryLong-Term Memory
Benjamin Stern, Peter Nadel
· 2026
Drawing on Memory uses dual-trace memory encoding, an evidence scoring gate, and a three-state retrieval protocol to store paired fact and scene traces in Letta’s archival memory. On LongMemEval-S, Drawing on Memory reaches 73.7% accuracy versus 53.5% for the fact-only C7-control baseline, a +20.2 percentage point gain concentrated in temporal, update, and multi-session questions.
BenchmarkBenchmarkAgent MemoryLong-Term Memory
Chingkwun Lam, Jiaxin Li et al.
· 2026
SSGM interposes a Governance Middleware, Read Filtering Gate, Write Validation Gate, and a dual substrate of Mutable Active Graph plus Immutable Episodic Log between agents and memory. SSGM unifies evolving-memory systems into a four-dimensional failure taxonomy and proves that periodic reconciliation can bound semantic drift over infinite horizons.
RAGLong-Term Memory
Yijie Zhong, Yunfan Gao, Haofen Wang
· 2026
HingeMem combines Boundary Guided Long-Term Memory, Dialogue Boundary Extraction, Memory Construction, Query Adaptive Retrieval, Hyperedge Rerank, and Adaptive Stop to segment dialogues into element-indexed hyperedges and plan query-specific retrieval. On LOCOMO, HingeMem achieves 63.9 overall F1 and 75.1 LLM-as-a-Judge score, surpassing the best baseline Zep (56.9 F1) by 7.0 F1 without using category-specific QA formats.
Cognitive ArchitectureLong-Term Memory
Diego C. Lerma-Torres
· 2026
Human-Like Lifelong Memory combines Executive Function and Working Memory, a Memory Service Knowledge Graph, and a Thalamic Gateway to implement dual-process, valence-aware lifelong memory. Human-Like Lifelong Memory is a theoretical framework with seven functional properties and testable predictions rather than benchmark numbers against specific baselines.
BenchmarkAgent MemoryLong-Term MemoryMemory Architecture
Jiaquan Zhang, Chaoning Zhang et al.
· 2026
LightMem orchestrates SLM-1 Controller, SLM-2 Selector, SLM-3 Writer, and STM MTM LTM stores to modularize retrieval, writing, and offline consolidation. On LoCoMo, LightMem reaches 34.50 F1 for GPT-4o multi hop questions, +1.64 over A-MEM, while keeping median retrieval latency at 83 ms.
Long-Term Memory
LPC-SM combines local attention, dual-timescale memory, predictive correction, Orthogonal Novelty Transport, and multi-head-coupled residual routing (mHC) inside a single autoregressive block. On OpenWebMath-10k continuation, LPC-SM with adaptive sparse control reaches final LM loss 10.787 versus 12.137 for a fixed sparse controller, a 12.517% improvement.
BenchmarkBenchmarkBenchmarkAgent MemoryLong-Term Memory
Weiwei Xie, Shaoxiong Guo et al.
· 2026
MemEvoBench combines Misleading Memory Injection, Noisy Tool Returns, Biased User Feedback, and a Memory Modification Tool (+ModTool) to stress-test long-term memory safety in LLM agents across 7 domains and 36 risk types. On the QA Style benchmark, MemEvoBench shows Gemini-2.5-Pro’s ASR drops from 67.0% (Vanilla) to 19.0% with +ModTool in Round 1, while biased feedback can push GPT-5’s QA ASR from 59.0% to 78.0% by Round 3.
RAGBenchmarkLong-Term Memory
Shu Wang, Edwin Yu et al.
· 2026
MemMachine combines Short-term memory, Long-term memory, Profile memory, and the Retrieval Agent to store raw conversational episodes and retrieve clustered context around nucleus matches. On LoCoMo, MemMachine scores 0.9169 with gpt-4.1-mini while using about 80% fewer input tokens than Mem0, and reaches 93.0% on LongMemEvalS with GPT-5-mini.
RAGAgent MemoryLong-Term MemoryMemory Architecture
Memory as Metabolism defines companion knowledge systems with five retention operations (TRIAGE, DECAY, CONTEXTUALIZE, CONSOLIDATE, AUDIT) plus memory gravity and minority-hypothesis retention over a raw buffer, active wiki, and cold memory. Instead of benchmark gains, Memory as Metabolism’s main result is a governance specification that separates descriptive, taxonomic, and normative claims and predicts improved coherence stability, fragility resistance, monoculture resistance, and effective minority-hypothesis influence for companion wikis.
Long-Term Memory
Shengtao Zhang, Jiaqian Wang et al.
· 2026
MemRL structures memory as an Intent-Experience-Utility bank and uses Two-Phase Retrieval plus Runtime Utility Update to learn a value-aware retrieval policy over a frozen LLM. On ALFWorld exploration, MemRL achieves 0.979 success rate compared to 0.921 for MemP, a +0.058 gain with the same frozen GPT-5-mini backbone.
Long-Term Memory
Bowen Yang, Kaiming Jin et al.
arXiv 2026 · 2026
OS-SYMPHONY coordinates an Orchestrator, Reflection-Memory Agent, and Versatile Tool Agents (Multimodal Searcher, Grounders, Coder) to stabilize long-horizon GUI workflows and fetch visual tutorials on demand. On OSWorld-Verified, OS-SYMPHONY with GPT-5 scores 65.84% at 100 steps, beating Agent S3 w/ GPT-5 (62.63%) by 3.21 percentage points.
BenchmarkLong-Term Memory
Mohammad Tavakoli, Alireza Salemi et al.
arXiv 2025 · 2025
LIGHT augments LLMs with Retrieval from the Conversation, Scratchpad Formation and Utilization, and a Working Memory buffer plus noise filtering to answer BEAM’s long-context probing questions. On the BEAM benchmark, LIGHT raises GPT-4.1-nano’s average score at 10M-token conversations from 0.109 to 0.226, a +107.3% gain over the vanilla long-context baseline.
RAGBenchmarkBenchmarkBenchmarkAgent MemoryLong-Term MemoryMemory Architecture
Alessandra Terranova, Björn Ross, Alexandra Birch
· 2025
Evaluating Long-Term Memory for Long-Context Question Answering compares Full Context, RAG, A-Mem, RAG+PromptOpt, and RAG+EpMem memory components across semantic, episodic, and procedural memory for long conversational QA. On LoCoMo, RAG+EpMem reaches an average F1 ranking of 1.83 for Llama 3.2-3B Instruct and 1.80 for GPT-4o mini while using around 1,000 tokens per query versus over 23,000 for Full Context.
Agent MemoryLong-Term MemoryMemory Architecture
Zhengjun Huang, Zhoujin Tian et al.
· 2025
LiCoMemory organizes long term dialogue with CogniGraph, Query Processing and Integrated Rerank, and Real Time Interactions to keep session summaries, triples, and chunks linked. On LongMemEval with GPT-4o-mini, LiCoMemory reaches 73.80% accuracy and 76.63% recall, beating Mem0g by 9.0 and 7.1 points.
BenchmarkLong-Term MemoryMemory Architecture
Jizhan Fang, Xinle Deng et al.
· 2025
LightMem pipelines a Cognitive-Inspired Sensory Memory, Topic Segmentation Submodule, Topic-Aware Short-Term Memory, and Long-Term Memory with Sleep-Time Update to filter, group, summarize, and asynchronously consolidate dialogue history. On LongMemEval-S with Qwen3-30B-A3B-Instruct-2507, LightMem reaches 70.20% ACC vs 65.20% for A-MEM (+5.00 points) while reducing total token usage by up to 21.8× and API calls by up to 17.1×.
PickLong-Term Memory
Prateek Chhikara, Dev Khant et al.
arXiv 2025 · 2025
Mem0 incrementally processes conversations using the extraction phase, update phase, asynchronous summary generation module, tool call mechanism, and a vector database to build scalable long-term memory. On the LOCOMO benchmark, Mem0 attains a J score of 67.13 on single-hop questions versus 63.79 for OpenAI and cuts p95 latency from 17.117s to 1.440s compared to the full-context baseline.
Cognitive ArchitectureLong-Term Memory
Tarik Houichime, Abdelghani Souhar, Younes El Amrani
· 2025
Phonetic Trajectory Memory (PTM) combines the Acoustic Injection, Entropy Filter, Neuro-Symbolic Relay, and Resonance Engine to encode text as a continuous trajectory on an ergodic Hyper-Torus Memory instead of a growing KV cache. PTM delivers >3,000× signal-to-KV compression while maintaining ≈92% factual accuracy and sub-50 ms retrieval latency on long narrative and scientific corpora compared to dense KV baselines.
Long-Term Memory
Yu Wang, Dmitry Krotov et al.
· 2025
M+ augments short-term memory θ, long-term memory Θ, a co-trained retriever, and a Multi-LoRA design on top of MemoryLLM’s layer-wise memory pools. On SQuAD-style knowledge retention, M+ maintains accuracy beyond 160k tokens while MemoryLLM-7B collapses before 20k and Llama-3.1-8B-SnapKV fails beyond 30k tokens.
RAGLong-Term MemoryMemory Architecture
Aneesh Jonelagadda, Christina Hahn et al.
· 2025
Mnemosyne combines a Commitment pipeline with substance and redundancy filters, a probabilistic Recall traversal over a graph-structured store, asynchronous Core Summary updates, and a Pruning module to manage long-term memory on edge devices. On the LoCoMo benchmark, Mnemosyne reaches 60.42% temporal reasoning J-score and a 54.55% overall J-score, compared to 51.55% temporal reasoning and 62.74% overall for Memory-R1, and achieves a 65.8% win rate over a 31.07% naive RAG baseline in human evaluations.
BenchmarkBenchmarkLong-Term Memory
Sangyeop Kim, Yohan Lee et al.
· 2025
PREMem builds long term dialogue memory by combining Episodic Memory Extraction, Pre Storage Memory Reasoning, semantic clustering, a persistent memory pool, and an inference phase over enriched memory fragments. PREMem reaches 71.4 LLM as a judge on LongMemEval with gpt 4.1 base, a +15.5 gain over HippoRAG 2 and +9.6 over A Mem.
RAGBenchmarkBenchmarkLong-Term Memory
Ao Tian, Yunfeng Lu et al.
· 2025
RGMem builds a multi-scale memory state using Microscopic Evidence Space DL0, Structured Knowledge Space G, and renormalization operators RK1, RK2, RK3 to evolve user profiles. On PersonaMem with GPT-4.1, RGMem reaches 74.01% Avg., beating Memory OS by 8.98 points.
Long-Term Memory
Yiming Du, Hongru Wang et al.
· 2024
PerLTQA builds a synthetic personal memory database and runs questions through Memory Classification, Memory Retrieval, and Memory Synthesis to test how LLMs use semantic and episodic memories. On the PerLTQA benchmark, BERT-base achieves 95.7 F1 for memory classification, while gpt-3.5-turbo reaches MAP 0.756 for memory synthesis with retrieval and classification.
Long-Term Memory
Eunkyung Jo, Yuin Jeong et al.
· 2024
CareCall combines a memory management layer, LLM summarizer, and memory-augmented input over HyperCLOVA to store and reuse summaries of users’ Health, Meals, Sleep, Visited Places, and Pets across weekly calls. In deployment to 147 socially isolated adults, CareCall with long-term memory yielded higher Health-detail and Clinical-detail disclosure counts per call than CareCall without memory, and longer average call durations (87.89s vs 75.48s).
Long-Term Memory
Weizhi Wang, Li Dong et al.
· 2023
LONGMEM augments a frozen GPT-2*-style backbone with a Residual SideNet, Cached Memory Bank, Memory Retrieval and Fusion, and Cross-Network Residual Connections to read and use long-term key–value memories. On ChapterBreak AO3, LONGMEM reaches 40.5% suffix identification accuracy with infinite in-memory context, compared to 28.3% for Memorizing Transformer under the same 1k in-context window.
Long-Term Memory
In search of dispersed memories connects associative memory networks, modern Hopfield networks, generative diffusion models, and the denoising loss to show that diffusion score networks encode Hopfield-like energy landscapes in their weights. In search of dispersed memories demonstrates that exact diffusion dynamics reach Pearson correlations of 0.995–0.996 with modern Hopfield iterations on denoising and completion tasks, while classical Hopfield networks reach only 0.700–0.741.
Long-Term Memory
Kai Zhang, Yangyang Kang et al.
· 2023
MaLP combines a Dual-Process enhanced Memory (DPeM), Working Memory, Short-Term Memory (STM), Long-Term Memory (LTM), a Coordinator C, and Retriever R to coordinate short- and long-term personalization around a PEFT-tuned LLM. On MaLP’s medical dialogue benchmark with LLaMA-7B, MaLP achieves 69.95% preference classification accuracy and a 91.53% win rate in response generation, improving ROUGE-L profile QA from 29.66 to 33.91 over the LoRA baseline.
Long-Term Memory
Lei Liu, Xiaoyan Yang et al.
· 2023
Think-in-Memory (TiM) augments an LLM agent with a hash-based Memory Cache, a Hash-based Mapping F(·), and Insert, Forget, Merge organization operations over inductive thoughts. On the Chinese part of the GVD dataset, TiM with ChatGLM raises contextual coherence from 0.428 to 0.665 compared to SiliconFriend.
Long-Term Memory
Jurgis Pasukonis, Timothy Lillicrap, Danijar Hafner
· 2022
Memory Maze combines an online reinforcement learning environment, an offline dataset, and an offline probing protocol to stress-test long-term memory using Dreamer, Dreamer (TBTT), IMPALA, and supervised probe networks. On Memory 9x9, Dreamer (TBTT) achieves a return of 33.2 compared to 23.4 for IMPALA, while humans reach 26.4 and the oracle reaches 34.8.
Long-Term Memory
Jingwei Zhang, Lei Tai et al.
arXiv 2017 · 2017
Neural SLAM combines an LSTM, Localization and Motion Prediction, Data Association, Measurement Update, and Mapping over an external memory map to guide exploration policies. On 16×16 grid worlds, Neural SLAM achieves 13.732 average reward and 46/50 success episodes, a +6.536 reward gain over A3C-Nav2.
Long-Term Memory
Yoav Levine, Or Sharir et al.
arXiv 2017 · 2017
On the Long-Term Memory of Deep Recurrent Networks analyzes Recurrent Arithmetic Circuits, Start-End separation rank, grid tensors, and Tensor Network constructions to quantify how depth affects temporal expressivity. The main result proves depth-2 RACs achieve Start-End separation rank on the order of the multiset coefficient (min{M,R} + T/2 − 1 choose T/2), while depth-1 RACs are limited to rank min{R, M^{T/2}}.