Self-Correcting RAG: Enhancing Faithfulness via MMKP Context Selection and NLI-Guided MCTS

AuthorsShijia Xu, Zhou Wu, Xiaolong Jia et al.

2026

TL;DR

Self-Correcting RAG uses an MMKP-based Context Selector plus NLI-guided MCTS to reach average EM 37.1 and F1 45.8 across six QA benchmarks.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

RAG wastes context and hallucinates on complex reasoning

Self-Correcting RAG targets low context utilization and frequent hallucinations in complex reasoning, where greedy top k selection ignores redundancy and information density.

On multi hop QA and fact verification, standard RAG pipelines miss key evidence, causing hallucinated answers that are not supported by retrieved documents.

HOW IT WORKS

Self-Correcting RAG via MMKP and NLI-Guided MCTS

Self-Correcting RAG introduces a MMKP-based Context Selector, NLI-Guided MCTS Generator, and Self-Correcting RAG Optimizer to jointly optimize retrieval and inference under multidimensional constraints.

You can think of Self-Correcting RAG like a smart RAM plus planner: MMKP packs the most informative chunks into limited memory, while MCTS explores reasoning paths like a search tree.

This design lets Self-Correcting RAG backtrack, verify entailment with NLI, and choose globally better reasoning trajectories than a plain context window with greedy decoding.

DIAGRAM

Inference Flow of Self-Correcting RAG

This diagram shows how Self-Correcting RAG processes a query through MMKP context selection and NLI-guided MCTS reasoning at test time.

DIAGRAM

Evaluation Pipeline and Ablation Design

This diagram shows how Self-Correcting RAG is evaluated across datasets and ablations for MMKP and MCTS components.

PROCESS

How Self-Correcting RAG Handles a Knowledge-Intensive Query

01
Phase I Optimal Context Selection via MMKP
Self-Correcting RAG clusters retrieved chunks into semantic groups Gi and uses the MMKP-based Context Selector to pick one representative per group under token and redundancy budgets.
02
The MMKP Optimization Objective
Self-Correcting RAG maximizes total utility Z(x) over vij with multidimensional capacity vector C, enforcing at most one document per group to reduce redundancy.
03
Phase II Inference-Time Reasoning via NLI-Guided MCTS
Self-Correcting RAG models reasoning as an MDP, where the NLI-Guided MCTS Generator explores actions agen and aaug over states st with optimized context.
04
The NLI Reward Function
Self-Correcting RAG computes R(y,Dctx) by aggregating ΘNLI(e,ul) over sentences, heavily penalizing contradictions with wcon ≪ 0 to prune hallucinated branches.

KEY CONTRIBUTIONS

Key Contributions

01
MMKP-based Context Selector
Self-Correcting RAG introduces the MMKP-based Context Selector that maximizes information density under token and redundancy budgets, improving average Recall@5 to 72.0% from 63.3% for RAG + MMR.
02
NLI-Guided MCTS Generator
Self-Correcting RAG develops the NLI-Guided MCTS Generator that uses ΘNLI and a contradiction penalty wcon ≪ 0 to guide search toward faithful reasoning paths.
03
Unified Self-Correcting RAG Framework
Self-Correcting RAG unifies MMKP context optimization and NLI-guided MCTS, achieving average EM 37.1 and F1 45.8 across six datasets, surpassing CRAG by +2.8 EM and +2.5 F1.

RESULTS

By the Numbers

Average EM

37.1

+2.8 over CRAG

Average F1

45.8

+2.5 over CRAG

Average Recall@5

72.0

+3.3 over CRAG

Attribution Precision AP

0.85

+0.27 over Standard RAG

On six QA benchmarks (NQ, PopQA, MuSiQue, 2Wiki, HotpotQA, MultiHop-RAG), Self-Correcting RAG improves both answer accuracy and retrieval quality, demonstrating that MMKP plus NLI-guided MCTS yields more faithful, evidence-grounded reasoning.

BENCHMARK

By the Numbers

BENCHMARK

Average Performance Across All Datasets

Average Exact Match (EM) across six QA benchmarks.

BENCHMARK

Ablation Study on Self-Correcting RAG Components

Average Attribution Precision (AP) across all datasets for different ablation settings.

KEY INSIGHT

The Counterintuitive Finding

Self-Correcting RAG with MCTS Only boosts Attribution Precision to 0.82, even though Recall@5 remains essentially unchanged at 50.1 versus 49.6.

This is surprising because many assume better retrieval is required for faithfulness, but Self-Correcting RAG shows NLI-guided reasoning alone can sharply reduce contradictions.

WHY IT MATTERS

What this unlocks for the field

Self-Correcting RAG unlocks retrieval that is both token-budget aware and redundancy-aware, while enforcing NLI-based faithfulness during reasoning.

Builders can now deploy RAG systems that trade extra test-time compute for verifiable, evidence-grounded answers on multi-hop and noisy multi-document tasks.

~13 min read← Back to papers

Related papers

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

RAG

A Dynamic Retrieval-Augmented Generation System with Selective Memory and Remembrance

Okan Bursa

· 2026

Adaptive RAG Memory (ARM) augments a standard retriever–generator stack with a Dynamic Embedding Layer and Remembrance Engine that track usage statistics and apply selective remembrance and decay to embeddings. On a lightweight retrieval benchmark, ARM achieves NDCG@5 ≈ 0.9401 and Recall@5 = 1.000 with 22M parameters, matching larger baselines like gte-small while providing the best efficiency among ultra-efficient models.

arXiv:2601.02428 Read explainer

RAGLong-Term Memory

HingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable Dialogues

Yijie Zhong, Yunfan Gao, Haofen Wang

· 2026

HingeMem combines Boundary Guided Long-Term Memory, Dialogue Boundary Extraction, Memory Construction, Query Adaptive Retrieval, Hyperedge Rerank, and Adaptive Stop to segment dialogues into element-indexed hyperedges and plan query-specific retrieval. On LOCOMO, HingeMem achieves 63.9 overall F1 and 75.1 LLM-as-a-Judge score, surpassing the best baseline Zep (56.9 F1) by 7.0 F1 without using category-specific QA formats.

arXiv:2604.06845 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…