VISUALSKILL: Multimodal Skills for Computer-Use Agents

Abstract: Computer-use agents (CUAs) approach human-level performance on standardised benchmarks but still struggle on long-horizon tasks and unseen software. Existing skill libraries address this with reusable skills, but represent the skill artifact as text only, despite the visual nature of GUI interaction.

摘要： 计算机操作智能体（CUAs）在标准化基准测试中已接近人类水平，但在处理长周期任务和未见过的软件时仍面临挑战。现有的技能库通过可重用技能来解决这一问题，但尽管图形用户界面（GUI）交互具有视觉属性，这些技能工件仍仅以文本形式呈现。

We propose VISUALSKILL: a hierarchical multimodal skill, tailored to each target application and organised as a central index over per-topic files, which the agent consumes through a load_topic MCP tool that fetches the relevant topic’s text and figures on demand. We construct each skill with a two-stage pipeline that combines authored documentation with live-application UI exploration.

我们提出了 VISUALSKILL：一种分层多模态技能。它针对每个目标应用程序进行定制，并以中央索引的形式组织各主题文件。智能体通过 load_topic MCP 工具按需获取相关主题的文本和图像。我们通过一个两阶段流水线构建每项技能，该流水线结合了人工编写的文档与实时应用程序的 UI 探索。

On two CUA benchmarks, CUA-World and OSExpert-Eval, a Claude Code CLI agent backed by Claude Opus 4.6 reaches an average score of 0.456 with VISUALSKILL, a +15.3 point absolute lift over the no-skill baseline (0.303).

在 CUA-World 和 OSExpert-Eval 这两个 CUA 基准测试中，由 Claude Opus 4.6 驱动的 Claude Code CLI 智能体在使用 VISUALSKILL 后平均得分达到 0.456，较无技能基准（0.303）提升了 15.3 个绝对百分点。

Against a matched text-only skill that is generated from the same source content and differs from VISUALSKILL only in modality, VISUALSKILL yields a further +8.3 point absolute gain over the matched text-only skill (0.373 vs. 0.456), providing direct evidence that retaining visual figures in the skill artifact, rather than verbalizing them away, helps the agent both identify UI elements and verify workflow state after each action. Our code is available at this https URL.

与由相同源内容生成、仅在模态上与 VISUALSKILL 不同的纯文本技能相比，VISUALSKILL 进一步获得了 8.3 个绝对百分点的提升（0.373 对比 0.456）。这直接证明了在技能工件中保留视觉图像（而非将其转化为文字描述）有助于智能体识别 UI 元素，并在每次操作后验证工作流状态。我们的代码可在该链接获取。