MeVe: A Modular System for Memory Verification and Effective Context Control in Language Models

AuthorsAndreas Ottem

2025

TL;DR

MeVe uses modular memory verification with cross-encoder filtering and fallback retrieval to cut context size by 57.7% on Wikipedia and 75% on HotpotQA versus Standard RAG.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Context pollution in top k RAG cuts efficiency by 57.7%

Standard top k RAG often injects irrelevant or redundant information, causing context pollution and degrading performance and efficiency in LLMs.

On a Wikipedia subset, Standard RAG uses 188.8 context tokens versus 79.8 for MeVe, wasting budget and increasing hallucination risk.

HOW IT WORKS

MeVe — Modular Memory Verification and Context Control

MeVe’s core mechanism chains Initial Retrieval, Relevance Verification, Fallback Retrieval, Context Prioritization, and Token Budgeting into a five phase, auditable pipeline.

You can think of MeVe like a CPU with RAM and cache: fast Initial Retrieval fills a buffer, Relevance Verification acts as a gate, and Token Budgeting is the strict cache size.

This modular verification and prioritization lets MeVe curate a compact, high utility context that a plain context window or monolithic RAG pipeline cannot manage or inspect.

DIAGRAM

Query Time Memory Verification Flow

This diagram shows how MeVe processes a single user query through retrieval, verification, fallback, prioritization, and token budgeting before sending context to the LLM.

DIAGRAM

Evaluation and Ablation Pipeline for MeVe

This diagram shows how MeVe is evaluated across Wikipedia and HotpotQA with different configurations and ablations.

PROCESS

How MeVe Handles a Query Lifecycle

01
Phase 1: Initial Retrieval
MeVe encodes the user query and runs kNN Initial Retrieval over the vector store to build Cinit with high recall candidates.
02
Phase 2: Relevance Verification
MeVe applies Relevance Verification using a cross encoder, scoring each candidate and discarding those below threshold τ to form Cver.
03
Phase 3: Fallback Retrieval
If |Cver| is below Nmin, MeVe triggers Fallback Retrieval with BM25Okapi to add Cfallback and avoid empty or weak contexts.
04
Phase 4: Context Prioritization
MeVe runs Context Prioritization to sort Call by relevance scores and remove redundant chunks before Token Budgeting packs Cfinal under Tmax.

KEY CONTRIBUTIONS

Key Contributions

01
Five Phase MeVe Architecture
MeVe introduces a five phase pipeline with Initial Retrieval, Relevance Verification, Fallback Retrieval, Context Prioritization, and Token Budgeting, making retrieval auditable and tunable.
02
Memory Verification with Cross Encoder
MeVe uses Relevance Verification via cross encoder ms marco MiniLM L 6 v2 with τ = 0.5 to aggressively filter irrelevant context before composition.
03
Context Efficiency on Wikipedia and HotpotQA
MeVe achieves a 57.7% context reduction on the Wikipedia subset and about 75% reduction on HotpotQA compared to Standard RAG while keeping retrieval time near 1.22 seconds.

RESULTS

By the Numbers

Context Eff. (Tokens)

79.8 tokens

-109.0 tokens vs Standard RAG

Time (s)

1.22 s

+0.10 s vs Standard RAG

HotpotQA Context Tokens

78.5 tokens

-230.1 tokens vs Standard RAG on HotpotQA

Standard RAG Context

188.8 tokens

baseline context size on Wikipedia subset

On a Wikipedia subset and HotpotQA, which test knowledge heavy QA and multi hop reasoning, MeVe shows large context reductions versus Standard RAG while keeping retrieval latency comparable, demonstrating that modular verification and budgeting can control context size without large time penalties.

BENCHMARK

By the Numbers

BENCHMARK

Quantitative results demonstrating MeVe’s context efficiency and retrieval time

Context Eff. (Tokens) for different retrieval modes on the Wikipedia subset.

KEY INSIGHT

The Counterintuitive Finding

MeVe cuts context size by about 57.7% on the Wikipedia subset, yet the context grounding proxy still often labels answers as derived from irrelevant context.

This is surprising because many assume tighter Relevance Verification and Context Prioritization automatically improve semantic grounding, but MeVe shows efficiency gains do not guarantee factual alignment.

WHY IT MATTERS

What this unlocks for the field

MeVe unlocks fine grained, phase level control over retrieval, verification, fallback, and budgeting, enabling explicit tuning of each memory stage.

Builders can now debug and adapt retrieval pipelines like software modules, swapping Fallback Retrieval strategies or Token Budgeting policies without rewriting monolithic RAG systems.

~13 min read← Back to papers

Related papers

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

RAG

A Dynamic Retrieval-Augmented Generation System with Selective Memory and Remembrance

Okan Bursa

· 2026

Adaptive RAG Memory (ARM) augments a standard retriever–generator stack with a Dynamic Embedding Layer and Remembrance Engine that track usage statistics and apply selective remembrance and decay to embeddings. On a lightweight retrieval benchmark, ARM achieves NDCG@5 ≈ 0.9401 and Recall@5 = 1.000 with 22M parameters, matching larger baselines like gte-small while providing the best efficiency among ultra-efficient models.

arXiv:2601.02428 Read explainer

RAGLong-Term Memory

HingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable Dialogues

Yijie Zhong, Yunfan Gao, Haofen Wang

· 2026

HingeMem combines Boundary Guided Long-Term Memory, Dialogue Boundary Extraction, Memory Construction, Query Adaptive Retrieval, Hyperedge Rerank, and Adaptive Stop to segment dialogues into element-indexed hyperedges and plan query-specific retrieval. On LOCOMO, HingeMem achieves 63.9 overall F1 and 75.1 LLM-as-a-Judge score, surpassing the best baseline Zep (56.9 F1) by 7.0 F1 without using category-specific QA formats.

arXiv:2604.06845 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…