Codebase-Memory: Tree-Sitter-Based Knowledge Graphs for LLM Code Exploration via MCP

AuthorsMartin Vogel, Falk Meyer-Eschenbach, Severin Kohler et al.

2026

TL;DR

Codebase-Memory uses a Tree-Sitter-based knowledge graph in SQLite exposed via MCP tools to answer code-structure questions with 10× fewer tokens at 83% quality vs. 92% for a file-exploration agent.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

LLM agents burn 10× tokens on unstructured code exploration

LLM coding agents repeatedly read files and grep search, consuming hundreds of thousands of tokens per task without structural understanding.

This makes structural questions like impact analysis expensive and slow, because agents lack call graphs, dependency chains, and module boundaries in a queryable form.

HOW IT WORKS

Codebase-Memory — Tree-Sitter Knowledge Graphs via MCP

Codebase-Memory centers on a three-stage pipeline with Parse stage, Build stage, Serve stage, plus a FunctionRegistry and Louvain communities stored in SQLite and exposed via the MCP tool interface.

You can think of Codebase-Memory as turning a codebase into a card catalog: files become graph nodes, relationships become edges, and MCP tools are the librarian answering structural questions.

This design lets Codebase-Memory answer structural queries like impact analysis or hub detection directly from the graph, instead of re-reading raw files into a limited context window.

DIAGRAM

Multi-pass pipeline from source files to knowledge graph

This diagram shows how Codebase-Memory runs the six pipeline phases to turn source files into a Tree-Sitter-based knowledge graph.

DIAGRAM

Head-to-head evaluation: MCP Agent vs Explorer Agent

This diagram shows how Codebase-Memory is evaluated by comparing an MCP Agent using MCP tools against an Explorer Agent using file reading and grep.

PROCESS

How Codebase-Memory Handles a Repository Query Session

01
Parse stage
Codebase-Memory walks Tree-Sitter ASTs across 66 languages, extracting definitions, call sites, imports, and traits into the Parse stage.
02
Build stage
Codebase-Memory runs the multi-pass pipeline in the Build stage, using parallel worker pools and the FunctionRegistry to assemble nodes and edges.
03
Serve stage
Codebase-Memory flushes the graph into SQLite, computes Louvain communities, and exposes everything via the MCP tool interface in the Serve stage.
04
Incremental synchronization
Codebase-Memory watches files, uses XXH3 hashes to detect changes, and re-runs the relevant pipeline phases to keep the knowledge graph fresh.

KEY CONTRIBUTIONS

Key Contributions

01
Knowledge-graph architecture for code
Codebase-Memory combines Tree-Sitter parsing across 66 languages, a multi-phase Build stage, FunctionRegistry, and Louvain communities into a single SQLite knowledge graph with zero external dependencies.
02
MCP-based structural tool interface
Codebase-Memory exposes 14 typed MCP tools, including search_graph, trace_call_path, query_graph, and get_architecture, enabling sub-millisecond structural queries for any MCP-compatible agent.
03
Head-to-head evaluation across 31 languages
Codebase-Memory achieves 0.83 answer quality versus 0.92 for a file-exploration agent, with ten times fewer tokens and 2.1 times fewer tool calls on 31 real-world repositories.

RESULTS

By the Numbers

Quality score

0.83

-0.09 vs Explorer Agent quality 0.92

Tool calls per question

2.3

-2.5 vs Explorer Agent 4.8 tool calls

Tokens per question

1000

-9000 vs Explorer Agent ∼10,000 tokens

Query latency

<1 ms vs Explorer Agent 10–30 s latency

On a benchmark of 12 question categories across 31 languages, Codebase-Memory delivers 0.83 quality versus 0.92 for the Explorer Agent while cutting tokens and tool calls dramatically. This shows that Codebase-Memory can trade a small quality gap for 10× token savings and >100× faster structural queries.

BENCHMARK

By the Numbers

BENCHMARK

Head-to-head comparison: MCP Agent vs Explorer Agent

Average quality score for structural code questions across 31 languages.

KEY INSIGHT

The Counterintuitive Finding

Codebase-Memory reaches 0.83 quality versus 0.92 for an Explorer Agent while using ten times fewer tokens and 2.1 times fewer tool calls.

This is surprising because many assume structural indexing would require more complexity and overhead, yet Codebase-Memory actually simplifies agent behavior and still matches or exceeds the Explorer Agent on 19 of 31 languages for graph-native queries.

WHY IT MATTERS

What this unlocks for the field

Codebase-Memory makes structural code retrieval a first-class capability, letting agents query call graphs, communities, and impact analysis directly via MCP tools.

This enables builders to create coding agents that start from a rich structural map instead of blind file crawling, making large, polyglot repositories and even Linux-kernel-scale projects practical to explore interactively.

~13 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…