MobileRAG: A Fast, Memory-Efficient, and Energy-Efficient Method for On-Device RAG

AuthorsTaehwan Park, Geonho Lee, Min-Soo Kim

2025

TL;DR

MobileRAG combines the EcoVector partitioned graph index with Selective Content Reduction to cut RAG power by up to 40.2% while keeping accuracy.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

On-device RAG hits memory, power, and latency walls on phones

MobileRAG targets mobile devices where a Galaxy S24 leaves only 5–6 GB RAM for apps, causing out-of-memory issues for large indices and rerankers.

Under these constraints, RAG pipelines with IVF or HNSW search and on-device sLMs suffer high power consumption, long TTFT, and degraded user experience on personal data search.

HOW IT WORKS

MobileRAG — EcoVector indexing plus Selective Content Reduction

MobileRAG centers on EcoVector, Selective Content Reduction, Index Build, and Index Update to partition vectors, partially load graphs, and shrink sLM inputs.

You can think of EcoVector as RAM holding a small centroid graph and disk holding many tiny per-cluster graphs, like a card catalog pointing to shelves.

This design lets MobileRAG search large personal collections and feed only the most relevant reduced chunks to the sLM, beyond what a plain context window can handle.

DIAGRAM

MobileRAG query-time retrieval and reduction pipeline

This diagram shows how MobileRAG processes a user query through EcoVector search and the SCR method before on-device sLM inference.

DIAGRAM

MobileRAG evaluation setup on SIFT, NYTimes, and QA benchmarks

This diagram shows how MobileRAG is evaluated on ANNS datasets and QA benchmarks for memory, latency, power, and accuracy.

PROCESS

How MobileRAG Handles a Chat Application Session

01
Index Build
MobileRAG runs Index Build to perform Document Selection, DB Construction, and EcoVector Index creation over user files on-device.
02
Index Update
MobileRAG uses Index Update to insert or delete vectors via EcoVector Update and synchronize the Embedding, Document, and Metadata tables.
03
Query Submission
In Query Submission, MobileRAG embeds the user query and runs EcoVector Vector Search to retrieve top k document chunks from the local index.
04
Selective Content Reduction
MobileRAG applies Selective Content Reduction to re-chunk, score, select, and reorder content before Prompt Augmentation and on-device sLM inference.

KEY CONTRIBUTIONS

Key Contributions

01
EcoVector indexing method
MobileRAG introduces EcoVector, which partitions vectors into clusters, keeps a centroids graph in RAM, and stores per-cluster inverted lists graphs on disk to cut memory and power.
02
Selective Content Reduction method
MobileRAG proposes Selective Content Reduction with similarity-based reordering, reducing SQuAD context tokens from 155 to 90 while preserving accuracy.
03
On-device MobileRAG Chat prototype
MobileRAG delivers a fully offline chat prototype on Galaxy S24, improving search latency by 1.72–8.89 times and reducing power consumption by up to 40.2%.

RESULTS

By the Numbers

Search latency speedup

1.72–8.89×

vs IVF and HNSW at 0.93 recall@10 on SIFT

TTFT reduction

1.18–1.41×

vs Naive-RAG, EdgeRAG, Advanced RAG across QA benchmarks

Memory reduction

10.7–54.5%

vs baseline RAG pipelines on mobile devices

Power reduction

24.4–40.2%

vs Naive-RAG, EdgeRAG, Advanced RAG end-to-end

These numbers come from SIFT, NYTimes, SQuAD, HotpotQA, and TriviaQA on a Galaxy S24, showing that MobileRAG maintains accuracy while sharply reducing latency, memory, and power.

BENCHMARK

By the Numbers

These numbers come from SIFT, NYTimes, SQuAD, HotpotQA, and TriviaQA on a Galaxy S24, showing that MobileRAG maintains accuracy while sharply reducing latency, memory, and power.

BENCHMARK

Qwen2.5 1.5B on SQuAD: Accuracy versus TTFT and Power

Accuracy on SQuAD with Qwen2.5 1.5B for MobileRAG and baselines.

KEY INSIGHT

The Counterintuitive Finding

MobileRAG’s SCR method cuts SQuAD context tokens from 155 to 90, a 42% reduction, without any loss of accuracy.

This is surprising because naive compressors and small initial chunks lose context and sharply reduce accuracy, yet MobileRAG’s post-retrieval reduction avoids that tradeoff.

WHY IT MATTERS

What this unlocks for the field

MobileRAG makes fully offline, privacy-preserving RAG practical on commodity phones by jointly optimizing vector search and sLM input size.

Builders can now deploy on-device assistants that search large personal collections with low latency and battery impact, without relying on server-side LLMs.

~14 min read← Back to papers

Related papers

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

RAG

A Dynamic Retrieval-Augmented Generation System with Selective Memory and Remembrance

Okan Bursa

· 2026

Adaptive RAG Memory (ARM) augments a standard retriever–generator stack with a Dynamic Embedding Layer and Remembrance Engine that track usage statistics and apply selective remembrance and decay to embeddings. On a lightweight retrieval benchmark, ARM achieves NDCG@5 ≈ 0.9401 and Recall@5 = 1.000 with 22M parameters, matching larger baselines like gte-small while providing the best efficiency among ultra-efficient models.

arXiv:2601.02428 Read explainer

RAGLong-Term Memory

HingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable Dialogues

Yijie Zhong, Yunfan Gao, Haofen Wang

· 2026

HingeMem combines Boundary Guided Long-Term Memory, Dialogue Boundary Extraction, Memory Construction, Query Adaptive Retrieval, Hyperedge Rerank, and Adaptive Stop to segment dialogues into element-indexed hyperedges and plan query-specific retrieval. On LOCOMO, HingeMem achieves 63.9 overall F1 and 75.1 LLM-as-a-Judge score, surpassing the best baseline Zep (56.9 F1) by 7.0 F1 without using category-specific QA formats.

arXiv:2604.06845 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…