Episodic Memory Question Answering

AuthorsSamyak Datta, Sameer Dharur, Vincent Cartillier et al.

2022

TL;DR

Episodic Memory Question Answering uses allocentric spatiotemporal scene memory plus LingUNet grounding to reach 29.11 IoU and 62.27 recall, beating SMNetDecoder by 2.19 IoU and 18.41 recall.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Egocentric assistants lack persistent episodic memory for localization

Episodic Memory Question Answering shows that language-only baselines reach only 4.75 IoU and 14.41 recall, and egocentric buffers nearly collapse to 0.01 IoU.

Without persistent scene memory, egocentric assistants cannot answer where and when questions like “where did you last see my keys?”, limiting real AR usefulness.

HOW IT WORKS

Episodic Memory Question Answering — allocentric spatiotemporal scene memory plus LingUNet grounding

Episodic Memory Question Answering builds an allocentric top-down semantic feature map, augments it with spatiotemporal memory, and uses a LingUNet-based question-answering model over this map.

Think of the allocentric map as a 2D RAM-like floorplan, while the spatiotemporal channels act like a hippocampal log of when each cell was observed.

This design lets Episodic Memory Question Answering reason over what, where, and when across entire tours, instead of being constrained by a short context window of egocentric frames.

DIAGRAM

Question-time inference flow for Episodic Memory Question Answering

This diagram shows how Episodic Memory Question Answering processes a stored tour and a new question to produce a localized answer.

DIAGRAM

EMQA dataset and training pipeline

This diagram shows how Episodic Memory Question Answering constructs EMQA data and trains RedNet, SMNet, and LingUNet in stages.

PROCESS

How Episodic Memory Question Answering Handles an EMQA episode

  1. 01

    Guided exploration tours

    Episodic Memory Question Answering takes a pre recorded sequence of about 2500 step RGB D maps and oracle pose as the guided tour input.

  2. 02

    Scene memory representation

    Episodic Memory Question Answering uses RedNet features, projects them to a 2D floorplan, and accumulates them with a GRU into an allocentric semantic map.

  3. 03

    Spatiotemporal memory

    Episodic Memory Question Answering stacks per step observed masks channel wise so each map cell encodes when it was seen during the tour.

  4. 04

    Question answering with LingUNet

    Episodic Memory Question Answering encodes the question with an LSTM, conditions LingUNet on this embedding and the spatiotemporal map, and outputs an answer heatmap.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Episodic Memory Question Answering task

    Episodic Memory Question Answering defines EMQA, where an egocentric assistant receives a tour and must localize answers to spatial and spatio temporal questions on the tour or floorplan.

  • 02

    Spatiotemporal allocentric scene memory

    Episodic Memory Question Answering extends SMNet style top down semantic maps with temporal channels, encoding when each 2cm by 2cm cell was observed during the tour.

  • 03

    Robust EMQA model and analysis

    Episodic Memory Question Answering combines this spatiotemporal memory with LingUNet, surpassing SMNetDecoder by 2.19 IoU and 18.41 recall, and shows resilience under noisy pose and real world RGB D.

RESULTS

By the Numbers

IoU

29.11

+2.19 over SMNetDecoder

Recall

62.27

+18.41 over SMNetDecoder

Precision

33.39

-7.56 vs EgoSemSeg

IoU egocentric pixel

29.78

+2.65 over SMNetDecoder egocentric

These numbers are on the EMQA benchmark built from Matterport3D tours, measuring localization quality on top down maps and egocentric pixels. The main result shows Episodic Memory Question Answering improves IoU from 26.92 to 29.11 and recall from 43.86 to 62.27 over SMNetDecoder while remaining robust across output spaces.

BENCHMARK

By the Numbers

These numbers are on the EMQA benchmark built from Matterport3D tours, measuring localization quality on top down maps and egocentric pixels. The main result shows Episodic Memory Question Answering improves IoU from 26.92 to 29.11 and recall from 43.86 to 62.27 over SMNetDecoder while remaining robust across output spaces.

BENCHMARK

EMQA top down IoU on test split

Intersection over Union for answer localization on top down map output space.

KEY INSIGHT

The Counterintuitive Finding

Episodic Memory Question Answering with temporal features boosts recall from 60.81 to 62.27, yet IoU for purely spatial questions barely changes.

This is surprising because one might expect temporal channels to harm spatial localization, but Episodic Memory Question Answering preserves spatial quality while sharply improving spatio temporal reasoning.

WHY IT MATTERS

What this unlocks for the field

Episodic Memory Question Answering enables egocentric assistants to answer where first and where last questions by grounding them on persistent spatiotemporal maps.

Builders can now design AR agents that remember entire home tours and localize objects on floorplans, instead of relying on short term egocentric buffers or language only biases.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

Questions about this paper?

Paper: Episodic Memory Question Answering

Answers use this explainer on Memory Papers.

Checking…