Episodic Memory Reader: Learning What to Remember for Question Answering from Streaming Data

AuthorsMoonsu Han, Minki Kang, Hyunwoo Jung, Sung Ju Hwang

arXiv 20192019

TL;DR

Episodic Memory Reader uses an RL-trained memory scheduler over streaming context to retain key entries, reaching 57.57 F1 on TriviaQA vs 50.10 for LIFO (+7.47).

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Streaming QA breaks when context dwarfs memory

TriviaQA documents average about 3K sentences per document, far exceeding typical neural QA memory limits and forcing truncation or heuristic selection.

When Episodic Memory Reader is replaced by span-prediction QA with naive truncation, long-document and video QA lose crucial evidence, degrading answer accuracy on realistic tasks.

HOW IT WORKS

Episodic Memory Reader — RL-based memory scheduling for streaming QA

Episodic Memory Reader uses a Data Encoder, Memory Encoder, Value Network, external memory, and QA solver to learn which streaming items to retain under tight memory.

You can think of Episodic Memory Reader as an OS page replacement policy for QA, learning which "pages" of context to keep in fast RAM-like external memory.

By learning replacement decisions over the whole memory, Episodic Memory Reader preserves globally important evidence that a fixed context window or simple LRU-style policy would discard.

DIAGRAM

Streaming interaction: Episodic Memory Reader managing memory over time

This diagram shows how Episodic Memory Reader interacts with streaming inputs, memory, and the QA solver from first context token to final answer.

DIAGRAM

Training pipeline for Episodic Memory Reader with RL

This diagram shows how Episodic Memory Reader is trained with A3C or REINFORCE on bAbI, TriviaQA, and TVQA.

PROCESS

How Episodic Memory Reader Handles a Streaming QA Episode

01
Data Encoder
Episodic Memory Reader uses the Data Encoder ψ(x(t)) to transform each incoming sentence, frame, or subtitle into a k-dimensional memory vector e(t).
02
Memory Encoder
Episodic Memory Reader applies EMR-Independent, EMR-biGRU, or EMR-Transformer as the Memory Encoder to score all memory entries plus e(t) for replacement.
03
Value Network
Episodic Memory Reader aggregates encoded memory with Deep Sets and a GRU-based Value Network to estimate future reward and guide A3C updates when used.
04
QA Solver
After the stream ends and the question arrives, Episodic Memory Reader passes the retained memory to the QA solver such as MemN2N, BERT, or the TVQA multi-stream model.

KEY CONTRIBUTIONS

Key Contributions

01
Learning what to remember from streaming data
Episodic Memory Reader formulates memory scheduling as RL over an external memory with a Memory Encoder and Value Network, under T ≫ N streaming constraints.
02
Episodic Memory Reader architecture
Episodic Memory Reader introduces EMR-Independent, EMR-biGRU, and EMR-Transformer Memory Encoders plus a Data Encoder and Value Network to learn replacement policies end-to-end.
03
Empirical gains on bAbI, TriviaQA, and TVQA
Episodic Memory Reader reaches 52.20 ExactMatch and 57.57 F1 on TriviaQA and about 65% accuracy on TVQA with 60 memory entries, improving over rule-based baselines.

RESULTS

By the Numbers

ExactMatch

52.20

+5.97 over LIFO

57.57

+7.47 over LIFO

TVQA accuracy (60 entries)

65.0

≈+10 over LIFO from Figure 9

bAbI memory size

5.0

Episodic Memory Reader retains supporting facts with only 5 slots

On TriviaQA Wikipedia, which has long documents averaging about 3K sentences, Episodic Memory Reader with EMR-biGRU achieves 52.20 ExactMatch and 57.57 F1 versus 46.23 ExactMatch and 50.10 F1 for LIFO. On TVQA, Episodic Memory Reader scales to 60 memory entries and maintains about 65% accuracy, showing that RL-based scheduling can handle large video streams under tight memory.

BENCHMARK

By the Numbers

BENCHMARK

TriviaQA Wikipedia: ExactMatch and F1 with 400-word memory

F1 on TriviaQA Wikipedia human-verified subset with BERT-based solver and 20 memory cells (400 words).

KEY INSIGHT

The Counterintuitive Finding

On TriviaQA, the simple LIFO policy reaches 50.10 F1, beating FIFO at 27.22 F1 despite discarding most later context.

This is surprising because naive recency-based deletion seems crude, but dataset bias places many answers early in documents, making LIFO accidentally strong until Episodic Memory Reader surpasses it.

WHY IT MATTERS

What this unlocks for the field

Episodic Memory Reader enables QA systems to read arbitrarily long text or video streams and still answer questions given only after the stream ends.

Builders can now design agents that watch full movies, long conversations, or multi-day logs while learning which moments to keep in a tiny external memory for future queries.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…