https://os-world.github.io/

 
More Research
Spider UnifiedSKG IC-DST Selective Annotation Binder DS-1000 
Instructor Text2Reward OpenAgents Lemur-70B ARKS

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real
Computer Environments

Tianbao Xie^1, Danyang Zhang^1, Jixuan Chen^1, Xiaochuan Li^1,
Siheng Zhao^1, Ruisheng Cao^1, Toh Jing Hua^1, Zhoujun Cheng^1, 
Dongchan Shin^1, Fangyu Lei^1, Yitao Liu^1, Yiheng Xu^1, Shuyan Zhou^
3, Silvio Savarese^2, Caiming Xiong^2, Victor Zhong^4, Tao Yu^1
^1The University of Hong Kong, ^2Salesforce Research, ^3Carnegie
Mellon University, ^4University of Waterloo
Paper Code Data Data Viewer Slides Twitter Discord

osworld task_demonstration **OSWorld** is a first-of-its-kind
scalable, real computer environment for multimodal agents, supporting
task setup, execution-based evaluation, and interactive learning
across operating systems. It can serve as a unified environment for
evaluating open-ended computer tasks that involve arbitrary apps
(e.g., task examples in the above Fig). We also create a benchmark of
369 real-world computer tasks in **OSWorld** with reliable,
reproducible setup and evaluation scripts.

Abstract

Autonomous agents that accomplish complex computer tasks with minimal
human interventions has the potential to transform human-computer
interaction, significantly enhancing accessibility and productivity.
However, existing benchmarks either lack an interactive environment
or are limited to environments specific to certain applications or
domains, failing to reflect the diverse and complex nature of
real-world computer use, thereby limiting the scope of tasks and
agent scalability. To address this issue, we introduce **OSWorld**,
the first-of-its-kind scalable, real computer environment for
multimodal agents, supporting task setup, execution-based evaluation,
and interactive learning across various operating systems such as
Ubuntu, Windows, and macOS. **OSWorld** can serve as a unified,
integrated computer environment for assessing open-ended computer
tasks that involve arbitrary applications. Building upon **OSWorld**,
we create a benchmark of 369 computer tasks involving real web and
desktop apps in open domains, OS file I/O, and workflows spanning
multiple applications. Each task example is derived from real-world
computer use cases and includes a detailed initial state setup
configuration and a custom execution-based evaluation script for
reliable, reproducible evaluation. Extensive evaluation of
state-of-the-art LLM/VLM-based agents on **OSWorld** reveals
significant deficiencies in their ability to serve as computer
assistants. While humans can accomplish over 72.36% of the tasks, the
best model achieves only 12.24% success, primarily struggling with
GUI grounding and operational knowledge. Comprehensive analysis using
**OSWorld** provides valuable insights for developing multimodal
generalist agents that were not possible with previous benchmarks.

OSWorld Environment Infrastructure

environment infrastructure The **OSWorld** environment uses a
configuration file for initializing tasks *(highlighted in red)*,
agent interaction, post-processing upon agent completion *
(highlighted in orange)*, retrieving files and information *
(highlighted in yellow)*, and executing the evaluation function *
(highlighted in green)*. The corresponding configuration items are
highlighted in colors that match their respective components within
the environment. Environments can run in parallel on a single host
machine for learning or evaluation purposes. Headless operation is
supported.

Data Statistics and Comparison

Below we present an overview of the main statistics of **OSWorld**,
showcasing the outline and a broad spectrum of tasks. **OSWorld**
contains a total of 369 tasks (and an additional 43 tasks on Windows
for analysis).



Key statistics of OSWorld.

The "Supp. tasks" refers to the Windows-based tasks, that could only
be used after activation due to copyright restrictions.

data-overview
data-composition

Distribution of task instructions in OSWorld
based on the app domains and operation types to showcase the content
intuitively.

We make a comparison of **OSWorld** against some other different
benchmarks for digital agents as presented below.
**The columns indicate:** whether they provide a controllable
executable environment *(Control. Exec. Env.)*, the ease of adding
new tasks involving arbitrary applications in open domains *
(Environment Scalability)*, support for multimodal agent evaluation *
(Multimodal Support)*, support for and inclusion of cross-app tasks *
(Cross-App)*, capability to start tasks from an intermediate initial
state *(Intermediate Init. State)*, and the number of execution-based
evaluation functions *(# Exec.-based Eval. Func.)*.

                         OSWorld
Size                     369
Excecutable Env.         Computer
 
Environment Scalability? [?]
Multimodal               [?]
Support?
Cross- App?              [?]
Intermediate             [?]
Init. State?
# Exec.-based            134
Eval. Func.

GAIA Mind2Web WebLINX PixelHelp MetaGUI AitW OmniAct   AgentBench   InterCode MiniWoB++ WebShop WebArena VisualWebArena WikiHow AssistGUI
466  2350     2337    187       1125    30k  9802    1091           1350      104       12k     812      910            150     100
                                              Multi-isolated Code      Web       Web     Web      Web            Mobile  
-    -        -       -         -       -    -                                                                    
                                                      
    [?]       [?]      [?]        [?]      [?]   [?]                             [?]        [?]      [?]       [?]             [?]      [?]
 
                                                                                                                  
    [?]       [?]                      [?]   [?]                                                                   [?]
                       
0    0        0       0         0       0    0       7              3         104       1       5        6              16      2

Benchmark

We adopt state-of-the-art LLM and VLM from open-source
representatives such as Mixtral and CogAgent, and closed-source ones
from GPT, Gemini, and Claude families on **OSWorld**, as LLM and VLM
agent baselines. We also explore methods such as the Set-of-Marks
aided approach, which has been demonstrated to improve spatial
capabilities for visual reasoning. **We are actively updating the
benchmark with new LLMs, VLMs and methods. Pull requests welcomed!**

  * A11y tree
  * Screenshot
  * Screenshot + A11y tree
  * Set-of-Mark

Notice: t = temperature, top-p = top-p cutoff, len = max context
length

     Rank                   Model                  Details     Score
                GPT-4                          t=1.0, top-p=
1                                              0.9
                OpenAI                                         12.24
Mar 20, 2024                                   len = 128k
                OpenAI, '23
                GPT-4 Vision (0409)            t=1.0, top-p=
2                                              0.9
                OpenAI                                         10.82
April 23, 2024                                 len = 32k
                OpenAI, '23
                Mixtral-8x7B                   t=1.0, top-p=
3                                              0.9
                MistralAI                                      2.98
Mar 20, 2024                                   len = 32k
                Jiang et al., '24
                GPT-3.5                        t=1.0, top-p=
4                                              0.9
                OpenAI                                         2.69
Mar 20, 2024                                   len = 16,385
                OpenAI, '23
                Gemini-Pro                     t=1.0, top-p=
5                                              0.9
                Google                                         2.37
Mar 20, 2024                                   len = 32k
                Gemini Team, Google, '23

Notice: t = temperature, top-p = top-p cutoff, len = max context
length

    Rank                     Model                    Details    Score
              Gemini-Pro Vision                    t=1.0, top-p=
1                                                  0.9
              Google                                             5.80
Mar 20, 2024                                       len = 32k
              Gemini Team, Google, '23
2             GPT-4 Vision (0409)                  t=1.0, top-p=
                                                   0.9
April 23,     OpenAI                                             5.40
2024                                               len = 32k
              OpenAI, '23
              GPT-4 Vision                         t=1.0, top-p=
3                                                  0.9
              OpenAI                                             5.26
Mar 20, 2024                                       len = 32k
              OpenAI, '23
              Claude-3-Opus                        t=1.0, top-p=
4                                                  0.9
              AnthropicAI                                        2.42
Mar 20, 2024                                       len = 200k
              Anthropic, '24
              CogAgent                             t=1.0, top-p=
5                                                  0.9
              Tsinghua University & Zhipu AI                     1.11
Mar 20, 2024                                       len =
              Hong et al., '23

Notice: t = temperature, top-p = top-p cutoff, len = max context
length

    Rank                     Model                    Details    Score
              GPT-4 Vision                         t=1.0, top-p=
1                                                  0.9
              OpenAI                                             12.17
Mar 20, 2024                                       len = 32k
              OpenAI, '23
2             GPT-4 Vision (0409)                  t=1.0, top-p=
                                                   0.9
April 23,     OpenAI                                             9.04
2024                                               len = 32k
              OpenAI, '23
              Claude-3-Opus                        t=1.0, top-p=
3                                                  0.9
              AnthropicAI                                        4.41
Mar 20, 2024                                       len = 200k
              Anthropic, '24
              Gemini-Pro Vision                    t=1.0, top-p=
4                                                  0.9
              Google                                             3.48
Mar 20, 2024                                       len = 32k
              Gemini Team, Google, '23
              CogAgent                             t=1.0, top-p=
5                                                  0.9
              Tsinghua University & Zhipu AI                     1.32
Mar 20, 2024                                       len =
              Hong et al., '23

Notice: t = temperature, top-p = top-p cutoff, len = max context
length

    Rank                     Model                    Details    Score
              GPT-4 Vision                         t=1.0, top-p=
1                                                  0.9
              OpenAI                                             11.77
Mar 20, 2024                                       len = 32k
              OpenAI, '23
2             GPT-4 Vision (0409)                  t=1.0, top-p=
                                                   0.9
April 23,     OpenAI                                             8.40
2024                                               len = 32k
              OpenAI, '23
              Claude-3-Opus                        t=1.0, top-p=
3                                                  0.9
              AnthropicAI                                        6.72
Mar 20, 2024                                       len = 200k
              Anthropic, '24
              Gemini-Pro Vision                    t=1.0, top-p=
4                                                  0.9
              Google                                             1.06
Mar 20, 2024                                       len = 32k
              Gemini Team, Google, '23
              CogAgent                             t=1.0, top-p=
5                                                  0.9
              Tsinghua University & Zhipu AI                     0.99
Mar 20, 2024                                       len =
              Hong et al., '23

Analysis

We conduct a qualitative analysis in the aspect of models, methods,
and human to find out factors influencing the performance of VLMs in
digital agent tasks and their underlying behavioral logic. We
investigate the impact of task attributes *(such as difficulty,
feasibility, visual requirement, and GUI complexity)*, input
measurements *(such as screenshot resolution, the influence of
trajectory history, and the effect of UI layout)*, and explore
whether there are patterns in the agent's performance across
different operating systems. Here is an overview of our analysis
outcome.
[down_sampling_exp_fig]
Higher screenshot resolution typically leads to improved performance.
[traj_length_effect_fig]
Longer text-based trajectory history context improves performance,
unlike screenshot-only history, but poses efficiency challenges.
[disturb_effect_fig]
Current VLM agents are not robust to UI layout and noise.
[cross_os_effect_fig]
The performance of VLM agents across different OS is in strong
correlation. This implies that insights and methodologies developed
within the **OSWorld** framework can be effectively transferred to
Windows environments with a high degree of reliability.
[success_case_fig]A success case of LLM/VLM agent baselines

Acknowledgement

We thank Sida Wang, Peter Shaw, Alane Suhr, Luke Zettlemoyer, Chen
Henry Wu, Pengcheng Yin, Shunyu Yao, Xing Han Lu, Siva Reddy, Ruoxi
Sun, Zhiyuan Zeng, and Lei Li for their helpful feedback on this work

FAQ

Q:

What are the running times and costs under different settings?

A:

        Setting          Expected   Budget Cost (Full Test Set/Small
                           Time*                Test Set)
GPT-4V (screenshot)     10h         $100 ($10)
Gemini-ProV             15h         0 (0)
(screenshot)
Claude-3 Opus           15h         $150 ($15)
(screenshot)
GPT-4V (a11y tree, SoM, 30h         $500 ($50)
etc.)

*No environment parallelism. Calculated in April 2024.

BibTeX

@misc{OSWorld,
      title={OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments},
      author={Tianbao Xie and Danyang Zhang and Jixuan Chen and Xiaochuan Li and Siheng Zhao and Ruisheng Cao and Toh Jing Hua and Zhoujun Cheng and Dongchan Shin and Fangyu Lei and Yitao Liu and Yiheng Xu and Shuyan Zhou and Silvio Savarese and Caiming Xiong and Victor Zhong and Tao Yu},
      year={2024},
      eprint={2404.07972},
      archivePrefix={arXiv},
      primaryClass={cs.AI}
}

  

This website is licensed under a Creative Commons
Attribution-ShareAlike 4.0 International License.

This means you are free to borrow the source code of this website, we
just ask that you link back to this page in the footer. Please
remember to remove the analytics code included in the header of the
website which you do not want on your website.