WebATLAS: An LLM Agent with Experience-Driven Memory and Action Simulation

AuthorsJiali Cheng, Anjishnu Kumar, Roshan Lal et al.

2025

TL;DR

WebATLAS uses experience-driven cognitive maps plus look-ahead action simulation to reach 63.0% success on WebArena-Lite, +9.1 points over Plan-and-Act.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

LLM web agents fail on long horizon navigation with 53.9% success

LLM web agents often struggle with long-horizon web navigation and task completion in new websites, producing inefficient action sequences unless fine-tuned on environment-specific data.

Systems like Plan-and-Act achieve only 53.9% success on WebArena-Lite, leaving many realistic multi-step tasks unsolved and risking irreversible actions and dead-end states on complex web interfaces.

HOW IT WORKS

WebATLAS — Actor Critic Task completion with Look ahead Action Simulation

WebATLAS centers on a Planner, Actor, Critic, and Multi-layered Memory that includes Working Memory, a Cognitive Map, and Semantic Memory to guide web actions.

You can think of WebATLAS like a brain with short term RAM, a long term map of the site, and a mental simulator that plays out moves before acting.

This design lets WebATLAS reuse past experiences, simulate multi step futures, and avoid risky actions that a plain context window or reactive agent would miss.

DIAGRAM

Look ahead Action Simulation loop in WebATLAS

This diagram shows how WebATLAS simulates candidate actions using the cognitive map before choosing a real web action.

DIAGRAM

WebATLAS evaluation and ablation pipeline

This diagram shows how WebATLAS is evaluated on WebArena Lite and how ablations modify components like the cognitive map and planner.

PROCESS

How WebATLAS Handles a WebArena Lite Task

  1. 01

    Planner

    The Planner in WebATLAS analyzes the natural language goal and initial observation to produce a structured plan P0 with subtasks and success predicates.

  2. 02

    Actor

    The Actor in WebATLAS conditions on the current plan, observation, and multi layered memory to propose a small set of executable next step candidates Ct.

  3. 03

    Critic

    The Critic in WebATLAS uses the cognitive map and semantic memory to simulate outcomes, compute V(a), and select the safest goal advancing action.

  4. 04

    Look ahead Action Simulation

    Look ahead Action Simulation in WebATLAS rolls out candidate actions over depth D using the cognitive map, updates plans via dynamic replanning, and writes new experience into memory.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Actor critic planner with LLM based look ahead

    WebATLAS introduces an actor critic planner where the Actor proposes candidates and the Critic uses look ahead simulation with a value function V(a) to choose safe actions, achieving 63.0% success on WebArena Lite.

  • 02

    Multi layered memory with cognitive map

    WebATLAS builds a multi layered memory including Working Memory, a Cognitive Map of transitions, and Semantic Memory of environment rules via curiosity driven exploration and agentic summarization.

  • 03

    Modular architecture without fine tuning

    WebATLAS offers a practical modular architecture that integrates planning, memory, and simulation to handle long horizon web tasks without any website specific LLM fine tuning.

RESULTS

By the Numbers

Avg w/ Multi-site

63.0%

+9.1 over Plan-and-Act

Avg w/o Multi-site

67.1%

+9.6 over Plan-and-Act

Gitlab

73.3%

+20.0 over AgentOccam

Shopping Admin

77.1%

+22.8 over Plan-and-Act

These metrics come from WebArena Lite, a 165 task benchmark for realistic multi step web navigation across Gitlab, Reddit, shopping, admin, maps, and multi site tasks. The 63.0% average success shows that WebATLAS converts experience driven memory and look ahead simulation into a +9.1 point gain over Plan-and-Act without environment specific training.

BENCHMARK

By the Numbers

These metrics come from WebArena Lite, a 165 task benchmark for realistic multi step web navigation across Gitlab, Reddit, shopping, admin, maps, and multi site tasks. The 63.0% average success shows that WebATLAS converts experience driven memory and look ahead simulation into a +9.1 point gain over Plan-and-Act without environment specific training.

BENCHMARK

Evaluation Results for WebATLAS versus other methods reported on WebArena Lite

Average success rate with multi site tasks on WebArena Lite.

BENCHMARK

Ablation Study Results for Individual Components of WebATLAS

Average success rate with multi site tasks for WebATLAS ablations on WebArena Lite.

KEY INSIGHT

The Counterintuitive Finding

Adding a direct HTML cognitive map (Base plus CM Raw) to WebATLAS initially reduced average success from 47.9% to 44.8% on WebArena Lite.

This is surprising because richer state storage seems helpful, but it shows that unfiltered raw HTML can overload LLM reasoning and that agentic summarization is crucial.

WHY IT MATTERS

What this unlocks for the field

WebATLAS unlocks web agents that can build and reuse cognitive maps, simulate future action sequences, and adapt to new sites without retraining.

Builders can now deploy long horizon web agents that safely explore, remember hazards, and refine plans online, making realistic multi step automation practical across diverse websites.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

Agent Memory

ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents

Xiaohui Zhang, Zequn Sun et al.

· 2026

ActMem transforms dialogue history into atomic facts via Memory Fact Extraction, groups them with Fact Clustering, links them through a Memory KG Construction module, and uses Counterfactual-based Retrieval and Reasoning for action-aware answers. On ActMemEval, ActMem reaches 76.52% QA accuracy with DeepSeek-V3, beating LightMem’s 63.97% by 12.55 points and NaiveRAG’s 61.54%.

Questions about this paper?

Paper: WebATLAS: An LLM Agent with Experience-Driven Memory and Action Simulation

Answers use this explainer on Memory Papers.

Checking…