Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation

AuthorsQuanting Xie, So Yeon Min, Pengliang Ji et al.

2024

TL;DR

Embodied-RAG uses a hierarchical semantic forest over topological embodied graphs plus LLM-guided tree traversal to reach P(Q|A)=0.67 on implicit multimodal queries vs 0.13 for LightRAG (+0.54).

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

RAG on embodied data fails to handle multimodal, redundant trajectories

Embodied experiences are multimodal, highly correlated tuples Et=(τt, st, pt), and naive RAG lacks cross-document structure, making retrieval over such data ineffective.

When robots explore kilometer-scale environments, dense maps and flat text chunk RAG become intractable and misaligned with semantic queries, breaking navigation and explanation capabilities.

Existing graphical RAG methods are too slow to build and query for real-time deployment on embodied agents.

HOW IT WORKS

Embodied-RAG — Semantic Forest over Topological Maps

Embodied-RAG constructs a Topological Map of poses, timestamps, images, and Captions, then hierarchically clusters them into a Semantic Forest with LLM-generated area summaries and Top-down Retrieval.

You can think of the Semantic Forest as a card catalog over the robot’s experiences, where nearby cards are bundled into shelves, and higher shelves summarize entire regions.

This hierarchical memory plus LLM-guided traversal lets Embodied-RAG answer abstract navigation and explanation queries that a flat context window or naive similarity search cannot handle.

DIAGRAM

Top-down Retrieval Flow in Embodied-RAG

This diagram shows how Embodied-RAG performs two-phase top-down retrieval over the semantic forest using LLM selection and hybrid re-ranking.

DIAGRAM

Embodied-Experiences Dataset and Evaluation Pipeline

This diagram shows how Embodied-RAG uses the Embodied-Experiences dataset to evaluate Find and Explain queries against RAG baselines.

PROCESS

How Embodied-RAG Handles a Query Session

01
Bottom-up Memory Construction
Embodied-RAG first builds a Topological Map from robot trajectories and captions, then clusters nodes into a Semantic Forest using hybrid spatial semantic similarity.
02
Topological Map
Embodied-RAG stores pose, timestamps, images, and GPT-4o Captions in graph nodes, connecting them along robot paths or within threshold α.
03
Semantic Forest
Embodied-RAG applies complete linkage clustering with Sspatial and Ssemantic, then uses an LLM summarizer to create hierarchical area summaries as new nodes.
04
Top-down Retrieval and Generation
Embodied-RAG runs LLM-based selection over the Semantic Forest, re-ranks base nodes with spatial and sensor scores, then feeds them to a generation LLM for answers or waypoints.

KEY CONTRIBUTIONS

Key Contributions

01
Task
Embodied-RAG extends RAG into embodied settings by defining Find and Explain queries over topological graphs and Semantic Forest memory across 19 environments.
02
Dataset
Embodied-RAG introduces the Embodied-Experiences Dataset with 14 simulated and 5 real environments, including a 3,525 node street view graph and over 200 queries.
03
Method
Embodied-RAG combines Bottom-up Memory Building and Top-down Retrieval to achieve P(Q|A)=0.67 on implicit multimodal queries and build graph memory 9.76× faster than LightRAG.

RESULTS

By the Numbers

P(Q|A) implicit Q only

0.67

+0.54 over LightRAG on E-multimodal implicit queries

P(Q|A) explicit Q only

0.58

+0.50 over GraphRAG on E-multimodal explicit queries

SS(A,Ae) global Q,S

0.95

+0.17 over LightRAG on E-multimodal global queries with sensors

Graph build time

1.0X

7.38× faster than GraphRAG and 9.76× faster than LightRAG

These results come from the E-image and E-multimodal Embodied-Experiences datasets, which test explicit, implicit, and global queries. The MAIN_RESULT shows that Embodied-RAG can retrieve relevant embodied memories and plan paths far more accurately than Naive-RAG, GraphRAG, and LightRAG while building memory much faster.

BENCHMARK

By the Numbers

BENCHMARK

Performance on the E-multimodal Dataset with Input Types (Implicit P(Q|A) Q only)

P(Q|A) for implicit Find queries (Q only) on the E-multimodal dataset.

KEY INSIGHT

The Counterintuitive Finding

Embodied-RAG builds structured graph memory 9.76 times faster than LightRAG while also achieving higher retrieval scores like P(Q|A)=0.67 vs 0.13.

This is surprising because richer hierarchical structure is usually assumed to be more expensive, yet Embodied-RAG’s Semantic Forest reduces LLM calls and graph complexity compared to text only graph builders.

Embodied-RAG shows that leveraging spatial priors can simultaneously speed up and improve nonparametric memory for embodied agents.

WHY IT MATTERS

What this unlocks for the field

Embodied-RAG unlocks scalable, hierarchical nonparametric memory for robots that supports both navigation waypoints and natural language explanations across kilometer-scale environments.

Builders can now plug RAG-style semantic memory directly into drones, quadrupeds, and locobots as a global planner, handling implicit queries like “find a quiet spot to read” that were previously impractical.

~14 min read← Back to papers

Related papers

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

RAG

A Dynamic Retrieval-Augmented Generation System with Selective Memory and Remembrance

Okan Bursa

· 2026

Adaptive RAG Memory (ARM) augments a standard retriever–generator stack with a Dynamic Embedding Layer and Remembrance Engine that track usage statistics and apply selective remembrance and decay to embeddings. On a lightweight retrieval benchmark, ARM achieves NDCG@5 ≈ 0.9401 and Recall@5 = 1.000 with 22M parameters, matching larger baselines like gte-small while providing the best efficiency among ultra-efficient models.

arXiv:2601.02428 Read explainer

RAGLong-Term Memory

HingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable Dialogues

Yijie Zhong, Yunfan Gao, Haofen Wang

· 2026

HingeMem combines Boundary Guided Long-Term Memory, Dialogue Boundary Extraction, Memory Construction, Query Adaptive Retrieval, Hyperedge Rerank, and Adaptive Stop to segment dialogues into element-indexed hyperedges and plan query-specific retrieval. On LOCOMO, HingeMem achieves 63.9 overall F1 and 75.1 LLM-as-a-Judge score, surpassing the best baseline Zep (56.9 F1) by 7.0 F1 without using category-specific QA formats.

arXiv:2604.06845 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…