MIRIX: Multi-Agent Memory System for LLM-Based Agents

AuthorsYu Wang, Xi Chen

2025

TL;DR

MIRIX uses six specialized memory components coordinated by a Meta Memory Manager and Active Retrieval to reach 85.38% on LOCOMO and 0.5950 accuracy on ScreenshotVQA with 99.9% less storage than RAG.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Flat text memories fail at multimodal, long-term personalization (35% accuracy gap and huge storage overhead)

Existing memory systems rely on flat, text-centric stores that cannot scale to 5,000–20,000 high-resolution screenshots or multimodal histories.

On ScreenshotVQA, a RAG baseline like SigLIP@50 needs up to 22.55GB storage and still only reaches 0.4410 accuracy, limiting real-world, long-term assistants.

HOW IT WORKS

MIRIX — Six Memory Components with Multi-Agent Control

MIRIX introduces six coordinated memories: Core Memory, Episodic Memory, Semantic Memory, Procedural Memory, Resource Memory, and Knowledge Vault, each managed by its own Memory Manager plus a Meta Memory Manager.

Think of MIRIX like a brain with specialized regions and a central executive, rather than one giant notebook; each region stores a different kind of experience or fact.

This design lets MIRIX route, abstract, and retrieve multimodal experiences via Active Retrieval in ways a plain context window or single vector store cannot.

DIAGRAM

Active Retrieval Flow for MIRIX Chat Agent

This diagram shows how MIRIX uses topic generation and multi-component retrieval before answering a user query.

DIAGRAM

ScreenshotVQA Evaluation Pipeline for MIRIX

This diagram shows how MIRIX collects screenshots, builds memory, and answers ScreenshotVQA questions compared to baselines.

PROCESS

How MIRIX Handles a Conversation Session

01
Memory Update Workflow
MIRIX first runs a search over all six components, then the Meta Memory Manager routes new content to Core Memory, Episodic Memory, Semantic Memory, Procedural Memory, Resource Memory, and Knowledge Vault.
02
Active Retrieval
Before answering, MIRIX forces the Chat Agent to generate a topic and retrieve top entries from each memory component using embedding_match, bm25_match, or string_match.
03
Conversational Retrieval Workflow
The Chat Agent performs a coarse search across all memories, then selectively issues targeted retrievals to specific Memory Managers for detailed information.
04
Controlled Rewrite and Consolidation
When Core Memory nears 90 percent capacity, MIRIX rewrites persona and human blocks, and other managers consolidate events into higher level entries for efficient long-term storage.

KEY CONTRIBUTIONS

Key Contributions

01
Six specialized memory components and eight agents
MIRIX defines Core Memory, Episodic Memory, Semantic Memory, Procedural Memory, Resource Memory, and Knowledge Vault, each controlled by dedicated Memory Managers and a Meta Memory Manager.
02
ScreenshotVQA multimodal benchmark
MIRIX introduces ScreenshotVQA with up to 18,178 screenshots per user and shows 0.5950 accuracy vs 0.4410 for SigLIP@50 while cutting storage from 15.07GB to 15.89MB.
03
State of the art on LOCOMO
MIRIX reaches 85.38 percent overall LLM as a Judge accuracy on LOCOMO, improving over Zep at 79.09 percent by 6.29 points using gpt-4.1-mini.

RESULTS

By the Numbers

Overall Accuracy ScreenshotVQA

0.5950

+0.1540 over SigLIP@50

Storage ScreenshotVQA

15.89MB

-99.9% vs SigLIP@50 15.07GB

Overall LOCOMO

85.38%

+6.29 over Zep gpt-4.1-mini

Single Hop LOCOMO

85.11%

-3.42 vs Full-Context 88.53%

On ScreenshotVQA, which tests multimodal memory over up to 20,000 screenshots, MIRIX beats SigLIP@50 from 0.4410 to 0.5950 while shrinking storage from 15.07GB to 15.89MB. On LOCOMO long-form conversations, MIRIX achieves 85.38 percent overall vs 79.09 percent for Zep and approaches the 87.52 percent Full-Context upper bound.

BENCHMARK

By the Numbers

BENCHMARK

LOCOMO Overall LLM-as-a-Judge Accuracy with gpt-4.1-mini

Overall accuracy (%) on LOCOMO conversations using gpt-4.1-mini as backbone.

KEY INSIGHT

The Counterintuitive Finding

MIRIX beats the long-context Gemini baseline on ScreenshotVQA by 410 percent relative accuracy while using 93.3 percent less storage.

This is surprising because one might expect feeding 3,600 images directly into a powerful long-context model to be superior to storing only distilled sqlite memories.

WHY IT MATTERS

What this unlocks for the field

MIRIX shows that multi-agent, typed memory with Active Retrieval can handle month-long multimodal histories on commodity storage.

Builders can now ship assistants and wearables that remember screenshots, workflows, and credentials over weeks without massive vector stores or retraining.

~14 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…