OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent

AuthorsBowen Yang, Kaiming Jin, Zhenyu Wu et al.

arXiv 20262026

TL;DR

OS-SYMPHONY uses a Reflection-Memory Agent plus a Multimodal Searcher to reach 65.84% on OSWorld, +3.21 points over Agent S3 w/ GPT-5.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

CUA frameworks lose visual context and lack visual-aware tutorials, capping OSWorld at 62.63%

Existing CUA frameworks lack granular control over historical visual context curation and pruning, causing visual loss and redundant frames in long-horizon tasks.

RAG modules often rely on unimodal text or local knowledge bases, leading to noisy text, costly updates, and poor generalization to unseen GUI workflows like OSWorld and WindowsAgentArena.

HOW IT WORKS

OS-SYMPHONY — Orchestrator plus Reflection-Memory and Multimodal Search

OS-SYMPHONY centers an Orchestrator that coordinates a Reflection-Memory Agent, Multimodal Searcher, Grounders, and Coder to decide each GUI action.

You can think of OS-SYMPHONY like a brain with RAM and long-term memory plus a web-savvy assistant that reads visual tutorials instead of plain manuals.

This design lets OS-SYMPHONY retain milestone screenshots, audit trajectories, and pull live multimodal tutorials, going far beyond what a plain context window or text-only RAG can handle.

DIAGRAM

Step-by-step trajectory with Reflection-Memory and Multimodal Search

This diagram shows how OS-SYMPHONY processes each step, updates milestone-driven memory, and invokes the Multimodal Searcher when the Reflection-Memory Agent flags errors.

DIAGRAM

Evaluation and Ablation Design for OS-SYMPHONY

This diagram shows how OS-SYMPHONY is evaluated across OSWorld, WindowsAgentArena, and MacOSArena and how ablations remove Searcher and Reflection-Memory components.

PROCESS

How OS-SYMPHONY Handles a Long-Horizon OSWorld Task

01
Orchestrator
The Orchestrator reads the instruction and recent short-term history, then decides whether to call Grounders, Coder, Multimodal Searcher, or act directly.
02
Reflection-Memory Agent
The Reflection-Memory Agent summarizes each step, marks milestone screenshots, updates long-term memory, and outputs trajectory-level reflections and error types.
03
Multimodal Searcher
When reflections indicate Lack of Tutorial, the Multimodal Searcher runs a See Act loop in a sandbox, then returns a structured visual tutorial to the Orchestrator.
04
Trajectory-Level Reflection
Using milestone-driven long-term memory, OS-SYMPHONY classifies the state as On-track, Completed, Infeasible, or Off-track and guides future planning until task completion.

KEY CONTRIBUTIONS

Key Contributions

01
OS-SYMPHONY holistic CUA framework
OS-SYMPHONY introduces a modular framework where an Orchestrator coordinates a Reflection-Memory Agent and Versatile Tool Agents to solve complex GUI tasks across operating systems.
02
Reflection-Memory Agent for long-term milestones
OS-SYMPHONY’s Reflection-Memory Agent builds milestone-driven long-term memory and structured auditing, enabling trajectory-level reflection for robust long-horizon planning.
03
Multimodal Searcher for visual-aware tutorials
OS-SYMPHONY’s Multimodal Searcher performs VLM-driven See Act browsing to synthesize step-by-step multimodal tutorials, boosting Daily domain success by 22.1% over w/o Search.

RESULTS

By the Numbers

Avg. success rate

65.84%

+3.21 over Agent S3 w/ GPT-5 (62.63%) on OSWorld at 100 steps

Workflow success

69.23%

+7.86 over Agent S3 w/ GPT-5 (61.37%) on OSWorld Workflow domain

WindowsArena avg.

63.5%

+6.9 over Agent S3 w/ GPT-5 (56.6%) at 100 steps

MacOSArena avg.

46.0%

+38.0 over previous best baseline reported for MacOSArena

OS-SYMPHONY is evaluated on OSWorld-Verified, WindowsAgentArena, and MacOSArena, which test multi-application, long-horizon GUI workflows. The 65.84% OSWorld score shows that OS-SYMPHONY’s reflection and multimodal search close a 3.21-point gap over Agent S3 while remaining robust across operating systems.

BENCHMARK

By the Numbers

BENCHMARK

Main results of OS-SYMPHONY on OSWorld

Average Step Success Rate (%) on OSWorld-Verified at 100-step or 50-step limits.

BENCHMARK

Ablation study on OSWorld with GPT-5-Mini

Average success rate (%) on OSWorld Daily and Workflow domains under different OS-SYMPHONY ablations.

KEY INSIGHT

The Counterintuitive Finding

OS-SYMPHONY with GPT-5-Mini reaches 58.05% on OSWorld, just 3.56 points behind OS-SYMPHONY with GPT-5 at 61.61–65.84%.

This is surprising because GPT-5-Mini is much weaker and cheaper, yet OS-SYMPHONY’s Reflection-Memory Agent and Multimodal Searcher nearly erase the capability gap using external multimodal knowledge.

WHY IT MATTERS

What this unlocks for the field

OS-SYMPHONY shows that careful reflection plus multimodal search can turn mid-tier VLMs into strong, generalist computer-using agents across OSes.

Builders can now deploy affordable agents that survive long-horizon workflows, detect loops, and fetch visual tutorials on the fly instead of relying on massive proprietary models or brittle static RAG.

~13 min read← Back to papers

Related papers

Agent MemoryLong-Term Memory

Adaptive Memory Admission Control for LLM Agents

Guilin Zhang, Wei Jiang et al.

· 2026

A-MAC scores candidate memories using Utility, Confidence, Novelty, Recency, and Type Prior combined by a learned linear admission policy with Algorithm 1 A-MAC Memory Admission. On the LoCoMo benchmark, A-MAC achieves F1 0.583 and 2644 ms latency, improving F1 by 0.042 and reducing latency by 1187 ms compared to A-mem.

arXiv:2603.04549 Read explainer

Long-Term Memory

Advancing Open-source World Models

Robbyant Team, Zelin Gao et al.

arXiv 2026 · 2026

LingBot-World combines a Data Engine, Fundamental World Model, Action-Conditioned World Model, and Post-Training causal adaptation to turn a 28B-parameter video generator into a real-time interactive world simulator. On the VBench benchmark, LingBot-World achieves a dynamic degree of 0.8857 versus 0.7612 for Yume-1.5, while also improving imaging quality to 0.6683.

arXiv:2601.20540 Read explainer

BenchmarkBenchmarkLong-Term Memory

AgenticAI-DialogGen: Topic-Guided Conversation Generation for Fine-Tuning and Evaluating Short- and Long-Term Memories of LLMs

Manoj Madushanka Perera, Adnan Mahmood et al.

· 2026

AgenticAI-DialogGen chains ChatPreprocessor, KnowledgeExtractor, TopicAnalyzer, KnowledgeGraphBuilder, PersonaGenerator, DuelingChat Agent, ConversationValidator, ConversationRefiner, QAGeneration, and PostProcessing to turn raw multi-session chats into topic-guided, persona-grounded conversations with explicit short- and long-term memories. On the TGC / KG memory QA benchmark, Mistral-7B fine-tuned within AgenticAI-DialogGen achieves 87.36 F1, compared to GPT-4’s 83.77 F1 in a zero-shot setting on the same task.

arXiv:2604.12179 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…