ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents

AuthorsMofasshara Rafique, Laurent Bindschaedler

2026

TL;DR

ClawVM adds harness-enforced typed pages with minimum-fidelity invariants and validated writeback, eliminating 67.8 mean policy-controllable faults per configuration with <50 μs overhead.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Stateful agents silently lose state and incur 67.8 mean memory faults per run

Stateful tool-using agents repeatedly lose constraints and tool outputs, with retrieval-only baselines averaging 67.8 explicit faults per workload–budget configuration.

Systems like OpenClaw-based assistants then re-run tools, forget bootstrap policies, and drop dirty state on reset, causing duplicate calls, broken protocols, and lost progress mid-plan.

HOW IT WORKS

ClawVM — Harness-managed virtual memory for agent state

ClawVM’s core mechanism is managing state as typed pages via a SessionPageTable, RepresentationSelector, FaultObserver, WritebackJournal, and ClawVMEngine under a strict token budget.

You can think of ClawVM as an OS-style virtual memory layer where the context window is RAM and durable stores are disk, but enforced by the agent harness.

This lets ClawVM enforce minimum-fidelity invariants, multi-resolution residency, and lifecycle-complete, validated writeback that a plain context window and best-effort compaction cannot guarantee.

DIAGRAM

Paging and writeback lifecycle for a single ClawVM turn

This diagram shows how ClawVM pages state, selects representations, detects faults, and performs validated writeback over a single agent turn.

DIAGRAM

Evaluation pipeline for ClawVM policies and oracle

This diagram shows how ClawVM replays workloads, compares policies to an offline oracle, and measures explicit faults and thrash.

PROCESS

How ClawVM Handles an Agent Session Lifecycle

  1. 01

    Prompt assembly and page selection

    ClawVM uses the SessionPageTable and RepresentationSelector to choose a resident set of typed pages and representations within the token budget for each call.

  2. 02

    Multi-resolution residency and degradation

    ClawVM maintains pages at full, compressed, structured, or pointer levels, degrading via the RepresentationSelector while respecting minimum-fidelity invariants.

  3. 03

    Fault observation and replay logging

    The FaultObserver records refetch, duplicate-tool, bootstrap, flush-miss, and silent-recall faults into the DecisionTrace for deterministic replay and oracle comparison.

  4. 04

    Validated writeback at lifecycle boundaries

    The WritebackJournal stages, validates, and commits non-destructive updates whenever compaction, pruning, or reset events occur, ensuring lifecycle-complete durability.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Virtual memory contract for agent state

    ClawVM defines typed pages with minimum-fidelity invariants and multi-resolution representations, enforced via the SessionPageTable and RepresentationSelector under a fixed token budget.

  • 02

    Observable fault model and replay oracle

    ClawVM’s FaultObserver and DecisionTrace introduce explicit refetch, duplicate-tool, bootstrap, flush-miss, and silent-recall faults plus an offline oracle with horizon h=3 for policy analysis.

  • 03

    Lifecycle-complete validated writeback

    ClawVM’s WritebackJournal enforces deterministic, non-destructive commit at compaction and reset, cutting mean explicit faults from 67.8 and 1.5 to 0.0 across 24 configurations.

RESULTS

By the Numbers

Explicit faults per config

0.0

-67.8 vs Retrieval

Explicit faults per config

0.0

-1.5 vs Comp-Hybrid

Thrash index

0.901

-77.4% vs Retrieval thrash 3.993

Policy overhead p50

<50 μs

median per turn across all workloads

On four OpenClaw-derived workload families across six token budgets, ClawVM matches an oracle’s 0.0 explicit faults while the Retrieval baseline averages 67.8 and Comp-Hybrid 1.5. The thrash index drops from 3.993 to 0.901, and the policy engine adds median <50 μs overhead per turn.

BENCHMARK

By the Numbers

On four OpenClaw-derived workload families across six token budgets, ClawVM matches an oracle’s 0.0 explicit faults while the Retrieval baseline averages 67.8 and Comp-Hybrid 1.5. The thrash index drops from 3.993 to 0.901, and the policy engine adds median <50 μs overhead per turn.

BENCHMARK

Aggregate explicit faults across 4 workloads × 6 budgets

Mean explicit policy-controllable faults per workload–budget configuration.

KEY INSIGHT

The Counterintuitive Finding

ClawVM with a simple LRU upgrade heuristic achieves the same 0.0 explicit faults and 0.901 thrash as the full utility-based policy and the oracle.

This is surprising because it shows structural features like auto-pinning, pointer resolution, and lifecycle writeback matter more for safety than sophisticated scoring heuristics.

WHY IT MATTERS

What this unlocks for the field

ClawVM gives builders a deterministic, auditable virtual memory layer where critical pages survive compaction and reset with zero policy-controllable faults across tested budgets.

With ClawVM, developers can treat stateful tool-using agents like systems with real virtual memory, confidently tuning retrieval and compression without risking silent state loss.

~12 min read← Back to papers

Related papers

Long-Term Memory

Adaptive Memory Admission Control for LLM Agents

Guilin Zhang, Wei Jiang et al.

· 2026

A-MAC scores candidate memories using Utility, Confidence, Novelty, Recency, and Type Prior combined by a learned linear admission policy with Algorithm 1 A-MAC Memory Admission. On the LoCoMo benchmark, A-MAC achieves F1 0.583 and 2644 ms latency, improving F1 by 0.042 and reducing latency by 1187 ms compared to A-mem.

Agent MemoryLong-Term Memory

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu, Liuyi Yao et al.

arXiv 2026 · 2026

Agentic Memory (AgeMem) exposes memory management tools, a three-stage progressive RL strategy, and step-wise GRPO directly inside the agent policy to jointly control long-term and short-term memory. On Qwen3-4B-Instruct, AgeMem attains 54.31% average performance across ALFWorld, SciWorld, PDDL, BabyAI, and HotpotQA, exceeding the best baseline A-Mem at 45.74%.

Questions about this paper?

Paper: ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents

Answers use this explainer on Memory Papers.

Checking…