Episodic Memory Question Answering

AuthorsSamyak Datta, Sameer Dharur, Vincent Cartillier et al.

2022

TL;DR

Episodic Memory Question Answering uses allocentric spatiotemporal scene memory plus LingUNet grounding to reach 29.11 IoU and 62.27 recall, beating SMNetDecoder by 2.19 IoU and 18.41 recall.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Egocentric assistants lack persistent episodic memory for localization

Episodic Memory Question Answering shows that language-only baselines reach only 4.75 IoU and 14.41 recall, and egocentric buffers nearly collapse to 0.01 IoU.

Without persistent scene memory, egocentric assistants cannot answer where and when questions like “where did you last see my keys?”, limiting real AR usefulness.

HOW IT WORKS

Episodic Memory Question Answering — allocentric spatiotemporal scene memory plus LingUNet grounding

Episodic Memory Question Answering builds an allocentric top-down semantic feature map, augments it with spatiotemporal memory, and uses a LingUNet-based question-answering model over this map.

Think of the allocentric map as a 2D RAM-like floorplan, while the spatiotemporal channels act like a hippocampal log of when each cell was observed.

This design lets Episodic Memory Question Answering reason over what, where, and when across entire tours, instead of being constrained by a short context window of egocentric frames.

DIAGRAM

Question-time inference flow for Episodic Memory Question Answering

This diagram shows how Episodic Memory Question Answering processes a stored tour and a new question to produce a localized answer.

DIAGRAM

EMQA dataset and training pipeline

This diagram shows how Episodic Memory Question Answering constructs EMQA data and trains RedNet, SMNet, and LingUNet in stages.

PROCESS

How Episodic Memory Question Answering Handles an EMQA episode

01
Guided exploration tours
Episodic Memory Question Answering takes a pre recorded sequence of about 2500 step RGB D maps and oracle pose as the guided tour input.
02
Scene memory representation
Episodic Memory Question Answering uses RedNet features, projects them to a 2D floorplan, and accumulates them with a GRU into an allocentric semantic map.
03
Spatiotemporal memory
Episodic Memory Question Answering stacks per step observed masks channel wise so each map cell encodes when it was seen during the tour.
04
Question answering with LingUNet
Episodic Memory Question Answering encodes the question with an LSTM, conditions LingUNet on this embedding and the spatiotemporal map, and outputs an answer heatmap.

KEY CONTRIBUTIONS

Key Contributions

01
Episodic Memory Question Answering task
Episodic Memory Question Answering defines EMQA, where an egocentric assistant receives a tour and must localize answers to spatial and spatio temporal questions on the tour or floorplan.
02
Spatiotemporal allocentric scene memory
Episodic Memory Question Answering extends SMNet style top down semantic maps with temporal channels, encoding when each 2cm by 2cm cell was observed during the tour.
03
Robust EMQA model and analysis
Episodic Memory Question Answering combines this spatiotemporal memory with LingUNet, surpassing SMNetDecoder by 2.19 IoU and 18.41 recall, and shows resilience under noisy pose and real world RGB D.

RESULTS

By the Numbers

IoU

29.11

+2.19 over SMNetDecoder

Recall

62.27

+18.41 over SMNetDecoder

Precision

33.39

-7.56 vs EgoSemSeg

IoU egocentric pixel

29.78

+2.65 over SMNetDecoder egocentric

These numbers are on the EMQA benchmark built from Matterport3D tours, measuring localization quality on top down maps and egocentric pixels. The main result shows Episodic Memory Question Answering improves IoU from 26.92 to 29.11 and recall from 43.86 to 62.27 over SMNetDecoder while remaining robust across output spaces.

BENCHMARK

By the Numbers

BENCHMARK

EMQA top down IoU on test split

Intersection over Union for answer localization on top down map output space.

KEY INSIGHT

The Counterintuitive Finding

Episodic Memory Question Answering with temporal features boosts recall from 60.81 to 62.27, yet IoU for purely spatial questions barely changes.

This is surprising because one might expect temporal channels to harm spatial localization, but Episodic Memory Question Answering preserves spatial quality while sharply improving spatio temporal reasoning.

WHY IT MATTERS

What this unlocks for the field

Episodic Memory Question Answering enables egocentric assistants to answer where first and where last questions by grounding them on persistent spatiotemporal maps.

Builders can now design AR agents that remember entire home tours and localize objects on floorplans, instead of relying on short term egocentric buffers or language only biases.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…