Mnemosyne: An Unsupervised, Human-Inspired Long-Term Memory Architecture for Edge-Based LLMs

AuthorsAneesh Jonelagadda, Christina Hahn, Haoze Zheng, Salvatore Penachio

2025

TL;DR

Mnemosyne uses unsupervised graph-structured memory with probabilistic decay and a core summary to reach 65.8% human win rate and 60.42% LoCoMo temporal reasoning.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Edge Healthcare Agents Lose Long-Term Context Beyond Small Windows

LLMs on edge devices are constrained by ∼4k token context windows, while long-context models need up to over 1M tokens and quadratic attention costs.

In longitudinal healthcare assistants, this causes predictable memory loss, where repetitive but temporally distinct conversations are mis-retrieved, degrading realism and temporal reasoning.

HOW IT WORKS

Mnemosyne — Graph Memory with Decay, Rewind, and Core Summary

Mnemosyne’s core mechanism wires a Commitment pipeline with Substance Filter, Redundancy Filter, graph Construction and Decay Calculation, probabilistic Recall, and a deep Core Summary module.

You can think of Mnemosyne like a brain-inspired card catalog: the graph is the catalog, Core Summary is the self-concept, and decay or rewind act like forgetting and reinforcement.

By encoding temporal decay, boosts, and naturalized time deltas directly into the graph, Mnemosyne retrieves time-aware memories that a plain context window or naive RAG pipeline cannot represent.

DIAGRAM

Probabilistic Recall and Temporal Decay Flow

This diagram shows how Mnemosyne selects a start node and traverses the memory graph with decay and rewind during recall.

DIAGRAM

LoCoMo Benchmarking and Human Evaluation Pipeline

This diagram shows how Mnemosyne is evaluated on LoCoMo scenarios and in blind human studies against naive RAG and ablated variants.

PROCESS

How Mnemosyne Handles a Longitudinal Healthcare Session

  1. 01

    Commitment

    Mnemosyne ingests interaction summaries and runs the Substance Filter and Redundancy Filter before adding nodes and edges to the memory graph with decay parameters.

  2. 02

    Recall

    On a new query, Mnemosyne selects a start node using hypothetical queries and naturalized time deltas, then traverses the graph with probabilistic decay and rewind.

  3. 03

    Core Summary

    Mnemosyne periodically selects a fixed length subset of central nodes, scores them with connectivity, boost, recency, and entropy, and updates the core summary supersummary.

  4. 04

    Pruning

    When memory limits approach, Mnemosyne scores nodes by selection probability without exploration and prunes low scoring nodes to keep the graph edge compatible.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Unsupervised Graph Memory Architecture

    Mnemosyne introduces an unsupervised, graph structured memory with Commitment, Recall, Core Summary, and Pruning modules that runs fully on edge devices using Redis and PubMedBERT embeddings.

  • 02

    Human Inspired Temporal Dynamics

    Mnemosyne implements probabilistic recall with temporal decay and a rewind boosting function, modeling primacy recency and forgetting curves to refresh redundant memories over time.

  • 03

    Core Summary for Persona Memory

    Mnemosyne adds a fixed length core summary supersummary that captures patient personality and long term clinical details, achieving a 65.8% win rate versus 31.07% naive RAG in blind human evaluations.

RESULTS

By the Numbers

Single-Hop (%)

62.78%

+2.95 over Memory-R1

Multi-Hop (%)

49.53%

vs Memory-R1 53.01%

Temporal Reasoning (%)

53.03%

+1.48 over Memory-R1

Overall (%)

54.55%

second to Memory-R1 62.74%

On the LoCoMo benchmark, which tests single hop, multi hop, open domain, and temporal reasoning, Mnemosyne achieves 62.78% single hop and 53.03% temporal reasoning J scores with a 54.55% overall score. These results show that Mnemosyne improves temporal reasoning and single hop recall for edge compatible Llama3.1 8B systems while remaining competitive overall with Memory-R1.

BENCHMARK

By the Numbers

On the LoCoMo benchmark, which tests single hop, multi hop, open domain, and temporal reasoning, Mnemosyne achieves 62.78% single hop and 53.03% temporal reasoning J scores with a 54.55% overall score. These results show that Mnemosyne improves temporal reasoning and single hop recall for edge compatible Llama3.1 8B systems while remaining competitive overall with Memory-R1.

BENCHMARK

LoCoMo Benchmark (J-score) of Mnemosyne compared to past related methods

Overall (%) J-score on LoCoMo for Llama3.1-8B-Instruct based memory systems.

KEY INSIGHT

The Counterintuitive Finding

Mnemosyne with the core summary wins 65.8% of human preference comparisons, while naive RAG wins only 31.07% despite similar factual retrieval capabilities.

This is surprising because many assume better multi hop retrieval is required for naturalness, yet Mnemosyne shows that persona level supersummaries and temporal dynamics matter more than perfect long horizon recall.

WHY IT MATTERS

What this unlocks for the field

Mnemosyne unlocks long horizon, human like memory for edge based healthcare assistants by combining unsupervised graph memory, temporal decay, rewind, and core summaries.

Builders can now deploy on device agents that track patient journeys over months, remember preferences and attitudes, and respond naturally without massive context windows or supervised RL policies.

~14 min read← Back to papers

Related papers

RAG

A Dynamic Retrieval-Augmented Generation System with Selective Memory and Remembrance

Okan Bursa

· 2026

Adaptive RAG Memory (ARM) augments a standard retriever–generator stack with a Dynamic Embedding Layer and Remembrance Engine that track usage statistics and apply selective remembrance and decay to embeddings. On a lightweight retrieval benchmark, ARM achieves NDCG@5 ≈ 0.9401 and Recall@5 = 1.000 with 22M parameters, matching larger baselines like gte-small while providing the best efficiency among ultra-efficient models.

RAGLong-Term Memory

HingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable Dialogues

Yijie Zhong, Yunfan Gao, Haofen Wang

· 2026

HingeMem combines Boundary Guided Long-Term Memory, Dialogue Boundary Extraction, Memory Construction, Query Adaptive Retrieval, Hyperedge Rerank, and Adaptive Stop to segment dialogues into element-indexed hyperedges and plan query-specific retrieval. On LOCOMO, HingeMem achieves 63.9 overall F1 and 75.1 LLM-as-a-Judge score, surpassing the best baseline Zep (56.9 F1) by 7.0 F1 without using category-specific QA formats.

Questions about this paper?

Paper: Mnemosyne: An Unsupervised, Human-Inspired Long-Term Memory Architecture for Edge-Based LLMs

Answers use this explainer on Memory Papers.

Checking…