Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation

AuthorsQuanting Xie, So Yeon Min, Pengliang Ji et al.

2024

TL;DR

Embodied-RAG uses a hierarchical semantic forest over topological embodied graphs plus LLM-guided tree traversal to reach P(Q|A)=0.67 on implicit multimodal queries vs 0.13 for LightRAG (+0.54).

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

RAG on embodied data fails to handle multimodal, redundant trajectories

Embodied experiences are multimodal, highly correlated tuples Et=(τt, st, pt), and naive RAG lacks cross-document structure, making retrieval over such data ineffective.

When robots explore kilometer-scale environments, dense maps and flat text chunk RAG become intractable and misaligned with semantic queries, breaking navigation and explanation capabilities.

Existing graphical RAG methods are too slow to build and query for real-time deployment on embodied agents.

HOW IT WORKS

Embodied-RAG — Semantic Forest over Topological Maps

Embodied-RAG constructs a Topological Map of poses, timestamps, images, and Captions, then hierarchically clusters them into a Semantic Forest with LLM-generated area summaries and Top-down Retrieval.

You can think of the Semantic Forest as a card catalog over the robot’s experiences, where nearby cards are bundled into shelves, and higher shelves summarize entire regions.

This hierarchical memory plus LLM-guided traversal lets Embodied-RAG answer abstract navigation and explanation queries that a flat context window or naive similarity search cannot handle.

DIAGRAM

Top-down Retrieval Flow in Embodied-RAG

This diagram shows how Embodied-RAG performs two-phase top-down retrieval over the semantic forest using LLM selection and hybrid re-ranking.

DIAGRAM

Embodied-Experiences Dataset and Evaluation Pipeline

This diagram shows how Embodied-RAG uses the Embodied-Experiences dataset to evaluate Find and Explain queries against RAG baselines.

PROCESS

How Embodied-RAG Handles a Query Session

  1. 01

    Bottom-up Memory Construction

    Embodied-RAG first builds a Topological Map from robot trajectories and captions, then clusters nodes into a Semantic Forest using hybrid spatial semantic similarity.

  2. 02

    Topological Map

    Embodied-RAG stores pose, timestamps, images, and GPT-4o Captions in graph nodes, connecting them along robot paths or within threshold α.

  3. 03

    Semantic Forest

    Embodied-RAG applies complete linkage clustering with Sspatial and Ssemantic, then uses an LLM summarizer to create hierarchical area summaries as new nodes.

  4. 04

    Top-down Retrieval and Generation

    Embodied-RAG runs LLM-based selection over the Semantic Forest, re-ranks base nodes with spatial and sensor scores, then feeds them to a generation LLM for answers or waypoints.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Task

    Embodied-RAG extends RAG into embodied settings by defining Find and Explain queries over topological graphs and Semantic Forest memory across 19 environments.

  • 02

    Dataset

    Embodied-RAG introduces the Embodied-Experiences Dataset with 14 simulated and 5 real environments, including a 3,525 node street view graph and over 200 queries.

  • 03

    Method

    Embodied-RAG combines Bottom-up Memory Building and Top-down Retrieval to achieve P(Q|A)=0.67 on implicit multimodal queries and build graph memory 9.76× faster than LightRAG.

RESULTS

By the Numbers

P(Q|A) implicit Q only

0.67

+0.54 over LightRAG on E-multimodal implicit queries

P(Q|A) explicit Q only

0.58

+0.50 over GraphRAG on E-multimodal explicit queries

SS(A,Ae) global Q,S

0.95

+0.17 over LightRAG on E-multimodal global queries with sensors

Graph build time

1.0X

7.38× faster than GraphRAG and 9.76× faster than LightRAG

These results come from the E-image and E-multimodal Embodied-Experiences datasets, which test explicit, implicit, and global queries. The MAIN_RESULT shows that Embodied-RAG can retrieve relevant embodied memories and plan paths far more accurately than Naive-RAG, GraphRAG, and LightRAG while building memory much faster.

BENCHMARK

By the Numbers

These results come from the E-image and E-multimodal Embodied-Experiences datasets, which test explicit, implicit, and global queries. The MAIN_RESULT shows that Embodied-RAG can retrieve relevant embodied memories and plan paths far more accurately than Naive-RAG, GraphRAG, and LightRAG while building memory much faster.

BENCHMARK

Performance on the E-multimodal Dataset with Input Types (Implicit P(Q|A) Q only)

P(Q|A) for implicit Find queries (Q only) on the E-multimodal dataset.

KEY INSIGHT

The Counterintuitive Finding

Embodied-RAG builds structured graph memory 9.76 times faster than LightRAG while also achieving higher retrieval scores like P(Q|A)=0.67 vs 0.13.

This is surprising because richer hierarchical structure is usually assumed to be more expensive, yet Embodied-RAG’s Semantic Forest reduces LLM calls and graph complexity compared to text only graph builders.

Embodied-RAG shows that leveraging spatial priors can simultaneously speed up and improve nonparametric memory for embodied agents.

WHY IT MATTERS

What this unlocks for the field

Embodied-RAG unlocks scalable, hierarchical nonparametric memory for robots that supports both navigation waypoints and natural language explanations across kilometer-scale environments.

Builders can now plug RAG-style semantic memory directly into drones, quadrupeds, and locobots as a global planner, handling implicit queries like “find a quiet spot to read” that were previously impractical.

~14 min read← Back to papers

Related papers

RAG

A Dynamic Retrieval-Augmented Generation System with Selective Memory and Remembrance

Okan Bursa

· 2026

Adaptive RAG Memory (ARM) augments a standard retriever–generator stack with a Dynamic Embedding Layer and Remembrance Engine that track usage statistics and apply selective remembrance and decay to embeddings. On a lightweight retrieval benchmark, ARM achieves NDCG@5 ≈ 0.9401 and Recall@5 = 1.000 with 22M parameters, matching larger baselines like gte-small while providing the best efficiency among ultra-efficient models.

RAGLong-Term Memory

HingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable Dialogues

Yijie Zhong, Yunfan Gao, Haofen Wang

· 2026

HingeMem combines Boundary Guided Long-Term Memory, Dialogue Boundary Extraction, Memory Construction, Query Adaptive Retrieval, Hyperedge Rerank, and Adaptive Stop to segment dialogues into element-indexed hyperedges and plan query-specific retrieval. On LOCOMO, HingeMem achieves 63.9 overall F1 and 75.1 LLM-as-a-Judge score, surpassing the best baseline Zep (56.9 F1) by 7.0 F1 without using category-specific QA formats.

Questions about this paper?

Paper: Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation

Answers use this explainer on Memory Papers.

Checking…