DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile Manipulation

AuthorsPeiqi Liu, Zhanqiu Guo, Mohit Warke et al.

2024

TL;DR

DynaMem uses an online dynamic 3D voxel memory with add-remove updates plus hybrid VLM and mLLM querying to reach 70% pick-and-drop success on non-stationary objects vs 30% for OK-Robot.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Open vocabulary mobile manipulation breaks in changing environments (70% vs 30% gap)

Existing open vocabulary mobile manipulation systems assume static environments, leading to only 30% success on non-stationary objects compared to 70% with DynaMem.

When objects move, static spatio semantic memory fails to localize goals, so navigation and manipulation stacks like OK Robot frequently chase outdated locations.

HOW IT WORKS

DynaMem — Dynamic 3D Voxel Map with hybrid querying

DynaMem combines a Dynamic 3D Voxel Map, Embedded Vision Language Features, Multimodal Large Language Models, and Exploration Primitives to maintain and query online spatio semantic memory.

You can think of DynaMem like RAM blocks that constantly rewrite 3D feature cells, while an mLLM acts as a librarian pointing to the most relevant recent snapshots.

This KEY_MECHANISM of add remove voxel updates plus hybrid VLM mLLM grounding lets DynaMem forget outdated geometry and abstain on missing objects, which a plain context window cannot.

DIAGRAM

DynaMem query and confirmation pipeline for object localization

This diagram shows how DynaMem processes a text query, selects images via voxel similarity or mLLM QA, and confirms presence with OWL v2 before returning 3D coordinates.

DIAGRAM

DynaBench offline evaluation and ablation flow

This diagram shows how DynaMem is evaluated on DynaBench with different query variants and ablations against human performance.

PROCESS

How DynaMem Handles an Open Vocabulary Mobile Manipulation Task

  1. 01

    Dynamic 3D Voxel Map

    DynaMem converts posed RGB D images into a sparse voxel grid, storing location, observation count, image ID, semantic features, and latest observation time.

  2. 02

    Adding Points

    DynaMem backprojects depth, segments with SAM v2, aggregates CLIP or SigLIP features, and updates voxel counts and averaged features for newly observed points.

  3. 03

    Removing Points

    DynaMem ray casts through the camera frustum, projecting voxels and deleting those that should be occluded but are not, handling moved or removed objects.

  4. 04

    Querying DynaMem for Object Localization

    DynaMem uses Embedded Vision Language Features or Multimodal Large Language Models plus OWL v2 to find the latest valid image and return a 3D object location or abstain.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Dynamic 3D Voxel Memory for changing environments

    DynaMem introduces a Dynamic 3D Voxel Map with add and remove operations, storing observation counts, image IDs, semantic features, and timestamps for online updates.

  • 02

    Hybrid VLM and mLLM querying with abstention

    DynaMem combines Embedded Vision Language Features, Multimodal Large Language Models, and OWL v2 cross checks to localize objects or explicitly return object not found.

  • 03

    DynaBench dynamic benchmark and real robot evaluation

    DynaMem is evaluated on DynaBench with nine environments and on Stretch SE3, achieving 70% success on non stationary objects versus 30% for OK Robot.

RESULTS

By the Numbers

Pick and drop success

70% success

+40 percentage points over OK-Robot static voxelmap baseline at 30% success

Dynamic object navigation failure

6.7% failure

-46.6 percentage points vs OK-Robot dynamic object navigation failure at 53.3%

Human success on DynaBench

81.9% success

DynaMem hybrid querying reaches 74.5%, 7.4 percentage points below humans

Hybrid query success

74.5% success

+3.9 percentage points over VLM feature default at 70.6% on DynaBench

These numbers come from real Stretch SE3 experiments and the DynaBench offline benchmark, which test dynamic 3D visual grounding and open vocabulary mobile manipulation. MAIN_RESULT shows that DynaMem makes dynamic environments tractable, closing much of the gap to human performance while doubling success over static OK-Robot.

BENCHMARK

By the Numbers

These numbers come from real Stretch SE3 experiments and the DynaBench offline benchmark, which test dynamic 3D visual grounding and open vocabulary mobile manipulation. MAIN_RESULT shows that DynaMem makes dynamic environments tractable, closing much of the gap to human performance while doubling success over static OK-Robot.

BENCHMARK

Real world pick and drop success on non stationary objects

Success rate for open vocabulary pick and drop tasks with changing object locations.

BENCHMARK

DynaBench ablation: query variants and human upper bound

Success rate on DynaBench for different DynaMem query variants and human participants.

KEY INSIGHT

The Counterintuitive Finding

On DynaBench, DynaMem with simple VLM feature querying already reaches 70.6% success, only 11.3 percentage points below human performance at 81.9%.

This is surprising because one might expect complex mLLM QA alone to dominate, yet DynaMem’s voxel feature pipeline plus OWL v2 cross check is already highly competitive.

WHY IT MATTERS

What this unlocks for the field

DynaMem enables robots to maintain and query a live 3D spatio semantic memory that gracefully handles objects appearing, moving, and disappearing over time.

Builders can now design open vocabulary mobile manipulation systems that explore, re search, and update maps online, instead of assuming static pre mapped scenes or blindly trusting outdated observations.

~14 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

Questions about this paper?

Paper: DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile Manipulation

Answers use this explainer on Memory Papers.

Checking…