MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents

AuthorsShu Wang, Edwin Yu, Oscar Love et al.

2026

TL;DR

MemMachine uses ground-truth episodic storage plus contextualized retrieval to reach 93.0% on LongMemEvalS and 0.9169 on LoCoMo with gpt-4.1-mini.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Long-term agents break under multi-session memory with brittle RAG workflows (LoCoMo 0.5290 for OpenAI baseline)

Standard RAG and context windows struggle with multi-session interactions, leading to brittle personalization and factual drift over long horizons.

On LoCoMo, the OpenAI baseline only reaches an overall score of 0.5290, limiting reliable personalization, temporal reasoning, and multi-hop conversational recall.

HOW IT WORKS

MemMachine — Ground-truth episodic memory with contextualized retrieval

MemMachine centers on Short-term memory, Long-term memory, Profile memory, and a Retrieval Agent to store raw episodes and retrieve sentence-level evidence.

Think of MemMachine like RAM plus disk plus a card catalog: short-term memory is RAM, long-term episodic storage is disk, and contextualized retrieval is the catalog that pulls neighboring pages.

This design lets MemMachine preserve exact conversational ground truth while selectively surfacing relevant episode clusters, something a plain context window or naive RAG cannot achieve.

DIAGRAM

MemMachine Memory Recall Pipeline

This diagram shows how MemMachine processes a query through short-term search, long-term vector search, contextualization, and reranking before returning episodes.

DIAGRAM

LongMemEvalS Ablation Design in MemMachine

This diagram shows how MemMachine varies ingestion and retrieval settings across LongMemEvalS configurations to measure accuracy gains.

PROCESS

How MemMachine Handles a Multi-session Conversational Query

  1. 01

    Data Ingestion

    MemMachine converts each message into an Episode with producer, timestamp, session id, and custom metadata, then dispatches it to Short-term memory and Long-term memory.

  2. 02

    Sentence Extraction

    MemMachine segments each Episode into sentences using NLTK Punkt, links them back to episodes, and embeds them into the vector-backed Long-term memory.

  3. 03

    Contextualized Retrieval

    MemMachine runs vector search to find nucleus episodes, expands them with neighboring episodes into clusters, and reranks clusters before assembling STM and LTM context.

  4. 04

    Retrieval Agent

    MemMachine optionally routes the query through the Retrieval Agent, choosing direct search, SplitQuery, or ChainOfQuery, then passes retrieved context to the answer LLM.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Ground-truth-preserving architecture

    MemMachine stores raw conversational episodes and indexes at sentence level, avoiding per-message extraction and enabling about 80% input token reduction versus Mem0 on LoCoMo.

  • 02

    Contextualized retrieval

    MemMachine introduces contextualized retrieval that expands nucleus matches with neighboring episode context, then reranks clusters, improving multi-hop and temporal reasoning accuracy.

  • 03

    Retrieval Agent for multi-hop reasoning

    MemMachine adds the Retrieval Agent with ToolSelectAgent, SplitQuery, and ChainOfQuery, reaching 93.2% on HotpotQA hard and 92.6% on WikiMultiHop under randomized noise.

RESULTS

By the Numbers

LoCoMo overall score (gpt-4.1-mini)

91.69%

+13.79 over OpenAI baseline

LongMemEvalS ablation best

93.0%

+7.0 over MemMachine baseline C5

HotpotQA hard Retrieval Agent accuracy

93.2%

+2.0 over MemMachine declarative search

Mem0 comparison input tokens

~80% less

MemMachine vs Mem0 on LoCoMo memory mode

On LoCoMo, which tests very long-term conversational memory, MemMachine reaches 91.69% with gpt-4.1-mini while using about 80% fewer input tokens than Mem0. On LongMemEvalS, which probes extraction, temporal reasoning, updates, and multi-session reasoning, MemMachine’s best configuration reaches 93.0%, showing that ground-truth episodic storage plus tuned retrieval yields strong long-horizon recall.

BENCHMARK

By the Numbers

On LoCoMo, which tests very long-term conversational memory, MemMachine reaches 91.69% with gpt-4.1-mini while using about 80% fewer input tokens than Mem0. On LongMemEvalS, which probes extraction, temporal reasoning, updates, and multi-session reasoning, MemMachine’s best configuration reaches 93.0%, showing that ground-truth episodic storage plus tuned retrieval yields strong long-horizon recall.

BENCHMARK

LoCoMo benchmark comparison across AI agent memory systems

LLM Judge Score on LoCoMo (overall).

BENCHMARK

LongMemEvalS configuration sweep in MemMachine

Overall LLM score on LongMemEvalS across key MemMachine configurations.

KEY INSIGHT

The Counterintuitive Finding

MemMachine finds that GPT-5-mini beats GPT-5 by 2.6 percentage points on LongMemEvalS when paired with the Edwin3 prompt and tuned retrieval.

This is counterintuitive because larger models are usually assumed strictly better, but MemMachine shows prompt–model co-optimization can make a smaller, cheaper model superior.

WHY IT MATTERS

What this unlocks for the field

MemMachine unlocks cost-efficient, ground-truth-preserving long-term memory with contextualized retrieval and multi-hop-aware routing that scale across sessions and benchmarks.

Builders can now deploy personalized agents that remember exact past interactions, handle complex multi-hop questions, and stay within tight token budgets using smaller LLMs like GPT-5-mini.

~14 min read← Back to papers

Related papers

RAG

Memory as Metabolism: A Design for Companion Knowledge Systems

Stefan Miteski

· 2026

Memory as Metabolism defines companion knowledge systems with five retention operations (TRIAGE, DECAY, CONTEXTUALIZE, CONSOLIDATE, AUDIT) plus memory gravity and minority-hypothesis retention over a raw buffer, active wiki, and cold memory. Instead of benchmark gains, Memory as Metabolism’s main result is a governance specification that separates descriptive, taxonomic, and normative claims and predicts improved coherence stability, fragility resistance, monoculture resistance, and effective minority-hypothesis influence for companion wikis.

RAG

Memory for Autonomous LLM Agents:Mechanisms, Evaluation, and Emerging Frontiers

Pengfei Du

· 2026

Memory for Autonomous LLM Agents decomposes agent memory into a POMDP-grounded write–manage–read loop, a three-dimensional taxonomy, and five mechanism families spanning context compression, retrieval stores, reflection, hierarchical virtual context, and policy-learned management. Memory for Autonomous LLM Agents synthesizes results like Voyager’s 15.3× tech-tree speedup and MemoryArena’s 80%→45% drop to show that memory architecture often matters more than backbone choice.

Questions about this paper?

Paper: MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents

Answers use this explainer on Memory Papers.

Checking…