Memory-enhanced Retrieval Augmentation for Long Video Understanding

AuthorsHuaying Yuan, Zheng Liu, Minghao Qin et al.

2025

TL;DR

MemVid uses a reasoning-oriented memory module with curriculum learning to guide retrieval, reaching 65.7% Avg on VideoMME vs 61.0% for Video-XL (+4.7 points).

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Long-video LVLMs still lose information under brute-force downsampling

MemVid targets long-video understanding where LVLMs still suffer from information loss due to compression and brute-force downsampling on hour-long videos.

When only sparse frames are processed, even strong long-context VLMs miss dispersed evidence, causing incorrect answers on benchmarks like VideoMME and LVBench.

HOW IT WORKS

MemVid — Memory-enhanced retrieval augmentation for long videos

MemVid centers on a memory model, memorizer, retriever, and generator to build reasoning-oriented KV-cache memory and generate retrieval clues.

You can think of MemVid like a human watching a movie once to store it in long-term memory, then consulting that memory to decide which scenes to rewatch.

This design lets MemVid reason over a holistic memory, generate task-oriented clues, and retrieve precise evidentiary moments that a plain context window or naive RAG cannot expose.

DIAGRAM

Four-step MemVid inference flow for long-video QA

This diagram shows how MemVid runs the Memorizing, Reasoning, Retrieving, and Focusing steps to answer a question about a long video.

DIAGRAM

MemVid training pipeline with SFT and DPO

This diagram shows how MemVid uses supervised fine-tuning warmup and reinforcement learning with generation feedback to optimize the memory model.

PROCESS

How MemVid Handles a Long-video Question Answering Session

01
Memorizing
MemVid uses the memory model and pretrained visual encoder to scan uniformly sampled frames and build a holistic KV-cache memory over the video.
02
Reasoning
Given the question, MemVid feeds it with the cached memory into the memorizer to generate multiple task-oriented retrieval clues and a draft answer.
03
Retrieving
MemVid sends each clue to the retriever, which ranks fixed-duration moments from the video database using LanguageBind-Large embeddings and cosine similarity.
04
Focusing
MemVid temporally reorders retrieved moments, samples frames plus some global context, and passes them with the question to the generator to produce the final answer.

KEY CONTRIBUTIONS

Key Contributions

01
MemVid memory-enhanced RAG framework
MemVid introduces a four-step memorizing, reasoning, retrieving, and focusing pipeline where a memory model and memorizer create holistic KV-cache memory for long videos.
02
Curriculum learning for the memorizer
MemVid uses supervised fine-tuning on 10,000 synthetic clues from a 72B VLM followed by DPO-based reinforcement learning with generation feedback to refine retrieval clues.
03
State-of-the-art long-video understanding
MemVid achieves 62.5 M-avg on MLVU, 65.7 Avg on VideoMME with subtitles, and 44.4 overall on LVBench, surpassing RAGsimple and long-context VLMs under the same 7B scale.

RESULTS

By the Numbers

VideoMME w subtitle Avg

65.7%

+4.7 over Video-XL (61.0%)

VideoMME w o subtitle Avg

63.7%

+5.6 over Video-XL (58.1%)

MLVU M avg

62.5

+5.5 over Qwen2VL-7B (57.0)

LVBench Overall

44.4

+7.2 over Qwen2VL-7B (37.2)

On MLVU, VideoMME, and LVBench, which test diverse long-video QA and reasoning, MemVid consistently beats strong 7B baselines like Qwen2VL and Video-XL by 4.7–7.2 points.

BENCHMARK

By the Numbers

On MLVU, VideoMME, and LVBench, which test diverse long-video QA and reasoning, MemVid consistently beats strong 7B baselines like Qwen2VL and Video-XL by 4.7–7.2 points.

BENCHMARK

Experimental results on VideoMME w subtitle (Avg, 7B models)

Accuracy Avg on VideoMME with subtitles for 7B-scale models.

BENCHMARK

Experimental results on LVBench Overall (7B models)

Overall score on LVBench for representative 7B or similar-scale models.

KEY INSIGHT

The Counterintuitive Finding

MemVid with only 64 frames achieves 53.5% on VideoMME (long, w o subtitle), beating Video-XL at 45.6% that uses 1024 frames.

This is surprising because many assume simply feeding more frames to a long-context VLM should always help, yet MemVid shows smarter retrieval beats raw context length.

WHY IT MATTERS

What this unlocks for the field

MemVid unlocks long-video QA where a compact, reasoning-guided memory can drive retrieval, making hour-long videos tractable for standard 7B VLMs.

Builders can now bolt MemVid onto existing VLMs to get large gains on LVU benchmarks without retraining giant long-context backbones or streaming thousands of frames.

~14 min read← Back to papers

Related papers

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

RAG

A Dynamic Retrieval-Augmented Generation System with Selective Memory and Remembrance

Okan Bursa

· 2026

Adaptive RAG Memory (ARM) augments a standard retriever–generator stack with a Dynamic Embedding Layer and Remembrance Engine that track usage statistics and apply selective remembrance and decay to embeddings. On a lightweight retrieval benchmark, ARM achieves NDCG@5 ≈ 0.9401 and Recall@5 = 1.000 with 22M parameters, matching larger baselines like gte-small while providing the best efficiency among ultra-efficient models.

arXiv:2601.02428 Read explainer

RAGLong-Term Memory

HingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable Dialogues

Yijie Zhong, Yunfan Gao, Haofen Wang

· 2026

HingeMem combines Boundary Guided Long-Term Memory, Dialogue Boundary Extraction, Memory Construction, Query Adaptive Retrieval, Hyperedge Rerank, and Adaptive Stop to segment dialogues into element-indexed hyperedges and plan query-specific retrieval. On LOCOMO, HingeMem achieves 63.9 overall F1 and 75.1 LLM-as-a-Judge score, surpassing the best baseline Zep (56.9 F1) by 7.0 F1 without using category-specific QA formats.

arXiv:2604.06845 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…