ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents

AuthorsMofasshara Rafique, Laurent Bindschaedler

2026

TL;DR

ClawVM adds harness-enforced typed pages with minimum-fidelity invariants and validated writeback, eliminating 67.8 mean policy-controllable faults per configuration with <50 μs overhead.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Stateful agents silently lose state and incur 67.8 mean memory faults per run

Stateful tool-using agents repeatedly lose constraints and tool outputs, with retrieval-only baselines averaging 67.8 explicit faults per workload–budget configuration.

Systems like OpenClaw-based assistants then re-run tools, forget bootstrap policies, and drop dirty state on reset, causing duplicate calls, broken protocols, and lost progress mid-plan.

HOW IT WORKS

ClawVM — Harness-managed virtual memory for agent state

ClawVM’s core mechanism is managing state as typed pages via a SessionPageTable, RepresentationSelector, FaultObserver, WritebackJournal, and ClawVMEngine under a strict token budget.

You can think of ClawVM as an OS-style virtual memory layer where the context window is RAM and durable stores are disk, but enforced by the agent harness.

This lets ClawVM enforce minimum-fidelity invariants, multi-resolution residency, and lifecycle-complete, validated writeback that a plain context window and best-effort compaction cannot guarantee.

DIAGRAM

Paging and writeback lifecycle for a single ClawVM turn

This diagram shows how ClawVM pages state, selects representations, detects faults, and performs validated writeback over a single agent turn.

DIAGRAM

Evaluation pipeline for ClawVM policies and oracle

This diagram shows how ClawVM replays workloads, compares policies to an offline oracle, and measures explicit faults and thrash.

PROCESS

How ClawVM Handles an Agent Session Lifecycle

01
Prompt assembly and page selection
ClawVM uses the SessionPageTable and RepresentationSelector to choose a resident set of typed pages and representations within the token budget for each call.
02
Multi-resolution residency and degradation
ClawVM maintains pages at full, compressed, structured, or pointer levels, degrading via the RepresentationSelector while respecting minimum-fidelity invariants.
03
Fault observation and replay logging
The FaultObserver records refetch, duplicate-tool, bootstrap, flush-miss, and silent-recall faults into the DecisionTrace for deterministic replay and oracle comparison.
04
Validated writeback at lifecycle boundaries
The WritebackJournal stages, validates, and commits non-destructive updates whenever compaction, pruning, or reset events occur, ensuring lifecycle-complete durability.

KEY CONTRIBUTIONS

Key Contributions

01
Virtual memory contract for agent state
ClawVM defines typed pages with minimum-fidelity invariants and multi-resolution representations, enforced via the SessionPageTable and RepresentationSelector under a fixed token budget.
02
Observable fault model and replay oracle
ClawVM’s FaultObserver and DecisionTrace introduce explicit refetch, duplicate-tool, bootstrap, flush-miss, and silent-recall faults plus an offline oracle with horizon h=3 for policy analysis.
03
Lifecycle-complete validated writeback
ClawVM’s WritebackJournal enforces deterministic, non-destructive commit at compaction and reset, cutting mean explicit faults from 67.8 and 1.5 to 0.0 across 24 configurations.

RESULTS

By the Numbers

Explicit faults per config

0.0

-67.8 vs Retrieval

Explicit faults per config

0.0

-1.5 vs Comp-Hybrid

Thrash index

0.901

-77.4% vs Retrieval thrash 3.993

Policy overhead p50

<50 μs

median per turn across all workloads

On four OpenClaw-derived workload families across six token budgets, ClawVM matches an oracle’s 0.0 explicit faults while the Retrieval baseline averages 67.8 and Comp-Hybrid 1.5. The thrash index drops from 3.993 to 0.901, and the policy engine adds median <50 μs overhead per turn.

BENCHMARK

By the Numbers

BENCHMARK

Aggregate explicit faults across 4 workloads × 6 budgets

Mean explicit policy-controllable faults per workload–budget configuration.

KEY INSIGHT

The Counterintuitive Finding

ClawVM with a simple LRU upgrade heuristic achieves the same 0.0 explicit faults and 0.901 thrash as the full utility-based policy and the oracle.

This is surprising because it shows structural features like auto-pinning, pointer resolution, and lifecycle writeback matter more for safety than sophisticated scoring heuristics.

WHY IT MATTERS

What this unlocks for the field

ClawVM gives builders a deterministic, auditable virtual memory layer where critical pages survive compaction and reset with zero policy-controllable faults across tested budgets.

With ClawVM, developers can treat stateful tool-using agents like systems with real virtual memory, confidently tuning retrieval and compression without risking silent state loss.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…