SE-Search: Self-Evolving Search Agent via Memory and Dense Reward

AuthorsJian Li, Yizhang Jin, Dongqi Liu et al.

2026

TL;DR

SE-Search uses Memory Purification, Atomic Query, and Dense Rewards to train a self-evolving search agent that reaches 0.420 average EM vs 0.312 for Search-R1-Base (+0.108).

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Search agents accumulate noisy documents and get only sparse rewards

Existing search agents often accumulate irrelevant or noisy documents and rely on sparse reinforcement learning signals, which limits effective learning and reasoning.

On complex multi-hop question answering, this causes redundant searches, long queries, and weak supervision, reducing answer accuracy and wasting external retrieval and compute.

HOW IT WORKS

SE-Search with Memory Purification, Atomic Query, and Dense Rewards

SE-Search centers on Memory Purification, Atomic Query, and Dense Rewards, optimized with Group Relative Policy Optimization to refine trajectories of think, search, memorize, and answer actions.

You can think of SE-Search like a researcher with a scratchpad and a strict advisor, continually rewriting notes, asking focused questions, and getting graded on every move.

This design lets SE-Search selectively store key evidence, issue short diverse queries, and receive fine-grained feedback, going far beyond what a fixed context window or naive RAG pipeline can provide.

DIAGRAM

Think Search Memorize Answer Interaction Flow

This diagram shows how SE-Search interleaves think, search, memorize, and answer actions while updating self-memory from retrieved documents.

DIAGRAM

Training and Ablation Pipeline for SE-Search

This diagram shows how SE-Search is trained on NQ and HotpotQA, then ablated by adding Memory Purification, Atomic Query, and Dense Rewards.

PROCESS

How SE-Search Handles a Think Search Memorize Answer Session

01
Think
SE-Search uses the Think step to interpret the question and plan reasoning, preparing for Atomic Query generation and later Memory Purification.
02
Search
In the Search step, SE-Search issues multiple short Atomic Query calls to the retriever, gathering top three documents per query from the external corpus.
03
Memorize
During Memorize, SE-Search applies Memory Purification to filter noisy passages and update self memory tokens <memory> with distilled key evidence.
04
Answer
In the Answer step, SE-Search conditions on purified memory and question to generate the final answer, guided by Dense Rewards for accuracy and format.

KEY CONTRIBUTIONS

Key Contributions

01
Self-Evolving Search Agent
SE-Search introduces a self-evolving search agent with Memory Purification, Atomic Query, and Dense Rewards, achieving 0.420 average EM across seven QA benchmarks.
02
Memory Purification
SE-Search uses Memory Purification to distill retrieved documents into evolving self memory, improving multi-hop datasets like Musique by 0.050 EM when first added.
03
Atomic Query and Dense Rewards
SE-Search combines Atomic Query counting with Dense Rewards over queries, memory, outcome F1, and format, yielding a 0.108 EM gain over Search-R1-Base.

RESULTS

By the Numbers

Avg. EM

0.420

+0.108 over Search-R1-Base

HotpotQA EM

0.450

+0.045 over AutoRefine-Base

Bamboogle EM

0.424

+0.080 over AutoRefine-Base

Search calls

1.32

14% fewer than initial 1.53 during training

On NQ, TriviaQA, PopQA, HotpotQA, 2Wiki, Musique, and Bamboogle, SE-Search reaches 0.420 average EM, showing that dense rewards and atomic queries materially improve search agent QA performance.

BENCHMARK

By the Numbers

On NQ, TriviaQA, PopQA, HotpotQA, 2Wiki, Musique, and Bamboogle, SE-Search reaches 0.420 average EM, showing that dense rewards and atomic queries materially improve search agent QA performance.

BENCHMARK

Accuracy comparison of SE-Search-3B against baseline methods using Qwen2.5-3B

Average EM across seven QA benchmarks.

BENCHMARK

Ablation results showing the effect of individual SE-Search components

Average EM over seven QA benchmarks for ablated SE-Search variants.

KEY INSIGHT

The Counterintuitive Finding

SE-Search increases average EM from about 0.36 to 0.41 while reducing average search calls from 1.53 to 1.32, both by roughly 14%.

This is surprising because you might expect more retrieval to be necessary for higher accuracy, but SE-Search shows that better Atomic Query and Memory Purification can use fewer calls more effectively.

WHY IT MATTERS

What this unlocks for the field

SE-Search unlocks search agents that adapt their search frequency, refine memory, and receive dense feedback, rather than blindly stacking retrieved passages.

Builders can now design RAG agents that learn when and how to search, using Atomic Query and Dense Rewards to trade off retrieval cost against answer quality in a principled way.

~12 min read← Back to papers

Related papers

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

RAG

A Dynamic Retrieval-Augmented Generation System with Selective Memory and Remembrance

Okan Bursa

· 2026

Adaptive RAG Memory (ARM) augments a standard retriever–generator stack with a Dynamic Embedding Layer and Remembrance Engine that track usage statistics and apply selective remembrance and decay to embeddings. On a lightweight retrieval benchmark, ARM achieves NDCG@5 ≈ 0.9401 and Recall@5 = 1.000 with 22M parameters, matching larger baselines like gte-small while providing the best efficiency among ultra-efficient models.

arXiv:2601.02428 Read explainer

RAGLong-Term Memory

HingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable Dialogues

Yijie Zhong, Yunfan Gao, Haofen Wang

· 2026

HingeMem combines Boundary Guided Long-Term Memory, Dialogue Boundary Extraction, Memory Construction, Query Adaptive Retrieval, Hyperedge Rerank, and Adaptive Stop to segment dialogues into element-indexed hyperedges and plan query-specific retrieval. On LOCOMO, HingeMem achieves 63.9 overall F1 and 75.1 LLM-as-a-Judge score, surpassing the best baseline Zep (56.9 F1) by 7.0 F1 without using category-specific QA formats.

arXiv:2604.06845 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…