MMAG: Mixed Memory-Augmented Generation for Large Language Models Applications

AuthorsStefano Zeppieri

2025

TL;DR

MMAG uses a five-layer mixed memory controller to coordinate conversational, long-term user, episodic, sensory, and working memories, yielding a 20% retention increase in Heero.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

LLM agents reset every session and lose continuity across interactions

LLMs excel within a single prompt but “fall short in sustaining relevance, personalization, and continuity across extended interactions.” Without memory, interactions remain shallow and short-lived.

This failure breaks language learning agents like Heero, which need long-term user and conversational memory to maintain goals, progress, and motivation over weeks of practice.

HOW IT WORKS

Mixed Memory-Augmented Generation pattern

MMAG’s core mechanism layers conversational memory, long-term user memory, episodic and event-linked memories, sensory and context-aware memory, and short-term working memory under a modular controller.

You can think of MMAG like a brain plus notebook: working memory is RAM, long-term user memory is a biographical card catalog, and episodic memories are timestamped diary entries.

This layered design lets MMAG coordinate multiple specialized memories, enabling proactive reminders and personalized tone that a plain context window cannot provide.

DIAGRAM

Heero Mixed Memory Inference Flow

This diagram shows how MMAG routes a Heero user query through conversational history, encrypted long-term bios, and working memory before generating a response.

DIAGRAM

MMAG Evaluation and User Study Pipeline

This diagram shows how MMAG is deployed in Heero and how user retention, conversation duration, and latency are measured.

PROCESS

How MMAG Handles a Heero Conversation Session

  1. 01

    Memory Storage and Retrieval Architecture

    MMAG uses the Memory interface to persist conversation turns into Firestore and retrieve chronological histories while enforcing a 90k token pruning threshold.

  2. 02

    Token Based Pruning Mechanism

    MMAG prunes excess conversational context beyond 90k tokens, approximating short-term working memory and preventing oversized prompts during Heero sessions.

  3. 03

    Prompt Engineering for Memory Referencing

    MMAG converts stored turns into OpenAI message roles and injects long-term user memory as structured system messages before dialogue content.

  4. 04

    Privacy Security and User Control

    MMAG encrypts user bios with envelope encryption, compresses them, and stores them in private S3 buckets to protect long-term user memory.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Mixed Memory-Augmented Generation pattern

    MMAG introduces a five-layer taxonomy combining conversational memory, long-term user memory, episodic and event-linked memories, sensory and context-aware memory, and short-term working memory for LLM agents.

  • 02

    Memory Storage and Retrieval Architecture

    MMAG defines a modular Memory interface backed by Firestore and S3, with token-based pruning around 90k tokens to manage conversational histories efficiently.

  • 03

    User Studies in Heero

    MMAG’s deployment in Heero shows a 20% increase in user retention and a 30% increase in average conversation duration after enabling memory-based conversations.

RESULTS

By the Numbers

User retention

20% increase

+20% over pre-MMAG Heero

Conversation duration

30% increase

+30% over pre-MMAG Heero

Context threshold

90k tokens

token limit for conversational history

Latency change

no increase

same range as before memory integration

The Heero language learning platform evaluates MMAG on real user interactions, measuring retention, conversation duration, and latency. These results show that MMAG improves engagement while keeping response times stable.

BENCHMARK

By the Numbers

The Heero language learning platform evaluates MMAG on real user interactions, measuring retention, conversation duration, and latency. These results show that MMAG improves engagement while keeping response times stable.

BENCHMARK

Impact of Memory-Based Conversations in Heero

Relative improvement in user retention and conversation duration after enabling MMAG-based memory in Heero.

KEY INSIGHT

The Counterintuitive Finding

MMAG adds multiple memory layers in Heero yet “average response latency remained within the same range as before memory integration.”

This is surprising because developers usually expect richer memory retrieval to slow responses, but MMAG’s asynchronous updates and caching avoid the typical latency penalty.

WHY IT MATTERS

What this unlocks for the field

MMAG unlocks memory-rich agents that coordinate conversational, biographical, episodic, sensory, and working memories without collapsing everything into a single flat store.

Builders can now design LLM applications that remember user goals over weeks, surface timely event reminders, and stay context-aware while still respecting privacy and latency constraints.

~10 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

Questions about this paper?

Paper: MMAG: Mixed Memory-Augmented Generation for Large Language Models Applications

Answers use this explainer on Memory Papers.

Checking…