MMAG: Mixed Memory-Augmented Generation for Large Language Models Applications

AuthorsStefano Zeppieri

2025

TL;DR

MMAG uses a five-layer mixed memory controller to coordinate conversational, long-term user, episodic, sensory, and working memories, yielding a 20% retention increase in Heero.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

LLM agents reset every session and lose continuity across interactions

LLMs excel within a single prompt but “fall short in sustaining relevance, personalization, and continuity across extended interactions.” Without memory, interactions remain shallow and short-lived.

This failure breaks language learning agents like Heero, which need long-term user and conversational memory to maintain goals, progress, and motivation over weeks of practice.

HOW IT WORKS

Mixed Memory-Augmented Generation pattern

MMAG’s core mechanism layers conversational memory, long-term user memory, episodic and event-linked memories, sensory and context-aware memory, and short-term working memory under a modular controller.

You can think of MMAG like a brain plus notebook: working memory is RAM, long-term user memory is a biographical card catalog, and episodic memories are timestamped diary entries.

This layered design lets MMAG coordinate multiple specialized memories, enabling proactive reminders and personalized tone that a plain context window cannot provide.

DIAGRAM

Heero Mixed Memory Inference Flow

This diagram shows how MMAG routes a Heero user query through conversational history, encrypted long-term bios, and working memory before generating a response.

DIAGRAM

MMAG Evaluation and User Study Pipeline

This diagram shows how MMAG is deployed in Heero and how user retention, conversation duration, and latency are measured.

PROCESS

How MMAG Handles a Heero Conversation Session

01
Memory Storage and Retrieval Architecture
MMAG uses the Memory interface to persist conversation turns into Firestore and retrieve chronological histories while enforcing a 90k token pruning threshold.
02
Token Based Pruning Mechanism
MMAG prunes excess conversational context beyond 90k tokens, approximating short-term working memory and preventing oversized prompts during Heero sessions.
03
Prompt Engineering for Memory Referencing
MMAG converts stored turns into OpenAI message roles and injects long-term user memory as structured system messages before dialogue content.
04
Privacy Security and User Control
MMAG encrypts user bios with envelope encryption, compresses them, and stores them in private S3 buckets to protect long-term user memory.

KEY CONTRIBUTIONS

Key Contributions

01
Mixed Memory-Augmented Generation pattern
MMAG introduces a five-layer taxonomy combining conversational memory, long-term user memory, episodic and event-linked memories, sensory and context-aware memory, and short-term working memory for LLM agents.
02
Memory Storage and Retrieval Architecture
MMAG defines a modular Memory interface backed by Firestore and S3, with token-based pruning around 90k tokens to manage conversational histories efficiently.
03
User Studies in Heero
MMAG’s deployment in Heero shows a 20% increase in user retention and a 30% increase in average conversation duration after enabling memory-based conversations.

RESULTS

By the Numbers

User retention

20% increase

+20% over pre-MMAG Heero

Conversation duration

30% increase

+30% over pre-MMAG Heero

Context threshold

90k tokens

token limit for conversational history

Latency change

no increase

same range as before memory integration

The Heero language learning platform evaluates MMAG on real user interactions, measuring retention, conversation duration, and latency. These results show that MMAG improves engagement while keeping response times stable.

BENCHMARK

By the Numbers

BENCHMARK

Impact of Memory-Based Conversations in Heero

Relative improvement in user retention and conversation duration after enabling MMAG-based memory in Heero.

KEY INSIGHT

The Counterintuitive Finding

MMAG adds multiple memory layers in Heero yet “average response latency remained within the same range as before memory integration.”

This is surprising because developers usually expect richer memory retrieval to slow responses, but MMAG’s asynchronous updates and caching avoid the typical latency penalty.

WHY IT MATTERS

What this unlocks for the field

MMAG unlocks memory-rich agents that coordinate conversational, biographical, episodic, sensory, and working memories without collapsing everything into a single flat store.

Builders can now design LLM applications that remember user goals over weeks, surface timely event reminders, and stay context-aware while still respecting privacy and latency constraints.

~10 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…