Benchmark · agentic

OSWorld

Real OS desktop tasks across Linux/macOS/Windows.

1 results 1 models
0 13.5 27 40.5 54 2026-06-18 three base models · 50 · 2026-06-18
three base models
Timeline
  1. 2026-06-18 three base models 50.0% Skill-Guided Continuation Distillation for GUI Agents