AriGraph: Learning Knowledge Graph World Models with Episodic Memory for LLM Agents

AuthorsPetr Anokhin, Nikita Semenov, Artyom Sorokin et al.

2024

TL;DR

AriGraph links semantic and episodic memories in a unified knowledge graph world model, enabling Ariadne to match NetPlay’s 675.33 vs 593.00 NetHack score while using only local observations.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

LLM agents with unstructured memory fail in complex text games

LLM agents typically rely on full history, summarization, or RAG, but these unstructured memories “do not facilitate the reasoning and planning essential for complex decision-making.”

In TextWorld and NetHack, this causes agents to miss scattered clues and outdated facts, so they cannot reliably navigate, plan, or complete long-horizon tasks.

HOW IT WORKS

AriGraph world model with semantic and episodic memory

AriGraph centers on a semantic memory graph, episodic memory vertices, semantic search, episodic search, and the Ariadne cognitive architecture for planning and decision making.

You can think of AriGraph like a hippocampus-backed card catalog: semantic memory is the catalog of facts, while episodic memory stores richly detailed pages linked to those facts.

This key design lets AriGraph retrieve structured world state and precise past observations, enabling reasoning and exploration that a plain context window or flat vector store cannot support.

DIAGRAM

AriGraph retrieval pipeline for decision making

This diagram shows how AriGraph runs semantic and episodic search to populate working memory before Ariadne chooses an action.

DIAGRAM

Evaluation setup across TextWorld, NetHack, and Q&A

This diagram summarizes how AriGraph is evaluated on TextWorld games, NetHack variants, and multi-hop Q&A benchmarks against LLM and RL baselines.

PROCESS

How AriGraph Handles an Interactive TextWorld Episode

01
AriGraph world model learning
AriGraph receives the new observation and uses semantic memory and episodic memory to add vertices and edges capturing objects and relations.
02
Memory graph search
AriGraph runs semantic search and episodic search over the graph to retrieve relevant triplets and past observations for the current situation.
03
Planning stage
Within the Ariadne cognitive architecture, the planning module uses working memory plus retrieved knowledge to generate or update a structured plan of sub goals.
04
Decision making
The ReAct based decision making module in Ariadne reads the plan and working memory, then selects an action aligned with AriGraph’s world model.

KEY CONTRIBUTIONS

Key Contributions

01
AriGraph world model
AriGraph defines a unified semantic memory and episodic memory graph G = (V_s, E_s, V_e, E_e) that is incrementally learned from textual observations in interactive environments.
02
Ariadne cognitive architecture
Ariadne integrates AriGraph with planning and decision making, enabling LLM agents to solve Treasure Hunt, Cleaning, and Cooking games that defeat full history, summarization, RAG, Simulacra, and Reflexion baselines.
03
Cross domain evaluation
AriGraph is evaluated in TextWorld, NetHack, and multi hop Q&A, achieving a NetHack score of 593.00 with room observations and 68.0 EM on HotpotQA using GPT 4.

RESULTS

By the Numbers

NetHack Score

593.00

+251.33 over NetPlay (Room obs)

NetHack Levels

6.33

+2.66 over NetPlay (Room obs)

HotpotQA EM

68.0

+13.0 over GPT-4 full context

MuSiQue EM

45.0

equal to HOLMES baseline EM 48.0 within 3.0 points

These metrics come from NetHack and the MuSiQue and HotpotQA benchmarks, which test exploration, long horizon reasoning, and multi hop question answering. The results show that AriGraph lets Ariadne nearly match a memory oracle on NetHack and reach 68.0 EM on HotpotQA while using a general purpose world model.

BENCHMARK

By the Numbers

BENCHMARK

NetHack performance with and without AriGraph world model

Average NetHack game score across three runs for different agents.

KEY INSIGHT

The Counterintuitive Finding

Ariadne with AriGraph and only room observations reaches a NetHack score of 593.00, close to NetPlay’s 675.33 with full level observations.

This is surprising because one would expect removing handcrafted level observations to cripple performance, yet AriGraph’s learned world model nearly closes the 82.33 point gap.

WHY IT MATTERS

What this unlocks for the field

AriGraph shows that LLM agents can learn structured world models online, combining semantic and episodic memory for long horizon reasoning and exploration.

Builders can now design agents that operate in partially observable environments like NetHack or TextWorld without handcrafted state or massive context windows, while still supporting multi hop Q&A over accumulated experience.

~13 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…