WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

AuthorsWoongyeong Yeo, Kangsan Kim, Jaehong Yoon, Sung Ju Hwang

2025

TL;DR

WorldMM uses adaptive multimodal episodic, semantic, and visual memories with iterative retrieval to reach 69.5% average accuracy, +8.4 points over M3-Agent on long video QA.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Long video agents miss events beyond context limits and fixed clip retrieval

WorldMM targets long videos where a day-long stream sampled at 1 fps still yields over 80k frames, far beyond typical video LLM context limits.

Existing memory agents rely on text-only summaries and fixed clip lengths, causing missed visual details and mismatched temporal scopes that break long-horizon question answering.

HOW IT WORKS

WorldMM — Multimodal Episodic, Semantic, and Visual Memory with Adaptive Retrieval

WorldMM’s core mechanism combines Episodic Memory, Semantic Memory, Visual Memory, an Adaptive Retrieval Agent, and a Response Agent to build and query multimodal memories across timescales.

You can think of WorldMM like a brain with a hippocampus for episodic graphs, a cortex for semantic habits, and a visual cortex indexed by a smart librarian retrieval agent.

This design lets WorldMM pull only the necessary multi-scale text and visual evidence instead of stuffing everything into a single context window, enabling grounded reasoning over hour- to week-long videos.

DIAGRAM

Adaptive Multimodal Retrieval Sequence

This diagram shows how WorldMM’s retrieval agent iteratively calls episodic, semantic, and visual memories before handing results to the response agent.

DIAGRAM

Multimodal Memory Construction Pipeline

This diagram shows how WorldMM constructs episodic graphs, semantic graphs, and visual memories from long video streams before inference.

PROCESS

How WorldMM Handles a Long Video Question Answering Session

  1. 01

    Multimodal Memory Construction

    WorldMM builds Episodic Memory graphs at multiple temporal scales, Semantic Memory via consolidation, and Visual Memory with features and timestamps before any query arrives.

  2. 02

    Adaptive Memory Retrieval

    WorldMM’s Retrieval Agent iteratively issues modality specific queries to Episodic Memory, Semantic Memory, and Visual Memory based on the user question and retrieval history.

  3. 03

    Episodic Memory Retrieval

    WorldMM uses Personalized PageRank over multi scale episodic graphs, then an LLM reranker selects top captions across timescales for the current query.

  4. 04

    Response Generation

    WorldMM’s Response Agent consumes the user query, retrieved episodic, semantic, and visual evidence, plus retrieval history to generate a grounded answer.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Multimodal Multi Scale Memory Design

    WorldMM introduces Episodic Memory, Semantic Memory, and Visual Memory that jointly encode seconds to hours with textual graphs and visual features for long video reasoning.

  • 02

    Adaptive Memory Retrieval Agent

    WorldMM’s Retrieval Agent iteratively selects memory type and temporal granularity, using PPR based episodic and semantic retrieval plus dual mode visual retrieval until a STOP signal.

  • 03

    State of the Art Long Video QA Performance

    WorldMM-GPT reaches 69.5% average accuracy over EgoLifeQA, Ego-R1 Bench, HippoVlog, LVBench, and Video-MME (L), improving over M3-Agent’s 55.1% by 14.4 points.

RESULTS

By the Numbers

EgoLifeQA

65.6%

+6.0 over HippoRAG

Ego-R1 Bench

65.3%

+9.3 over HippoRAG

HippoVlog

78.3%

+15.1 over HippoMM

Avg.

69.5%

+8.4 over M3-Agent

On EgoLifeQA, Ego-R1 Bench, HippoVlog, LVBench, and Video-MME (L), which test hour to week long video question answering, WorldMM’s 69.5% average accuracy shows that adaptive multimodal memory retrieval scales long video reasoning beyond prior memory agents.

BENCHMARK

By the Numbers

On EgoLifeQA, Ego-R1 Bench, HippoVlog, LVBench, and Video-MME (L), which test hour to week long video question answering, WorldMM’s 69.5% average accuracy shows that adaptive multimodal memory retrieval scales long video reasoning beyond prior memory agents.

BENCHMARK

Performance of WorldMM with various baselines across long video QA benchmarks

Average accuracy across EgoLifeQA, Ego-R1 Bench, HippoVlog, LVBench, and Video-MME (L).

KEY INSIGHT

The Counterintuitive Finding

WorldMM’s full memory configuration (E+S+V) improves HabitInsight accuracy to 76.9%, a 23 point gain over the episodic plus visual setting (E+V) at 53.9%.

This is surprising because many systems assume episodic clips and visual frames suffice, but WorldMM shows that explicitly maintained semantic memory is crucial for long term habit reasoning.

WHY IT MATTERS

What this unlocks for the field

WorldMM unlocks long horizon, multimodal reasoning where an agent can answer questions about week long egocentric streams using adaptive episodic, semantic, and visual memories.

Builders can now design agents that selectively combine multi scale textual graphs and visual evidence instead of brute forcing huge context windows, enabling practical deployment on ultra long videos.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

Questions about this paper?

Paper: WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

Answers use this explainer on Memory Papers.

Checking…