SentinelBench: A Benchmark for Long-Running Monitoring Agents

SentinelBench：用于长期监控智能体的基准测试

AI agents are increasingly asked to carry out work that spans minutes, hours, or longer. Yet the default model of agent behavior is continuous action: issuing tool calls, refreshing pages, searching for alternatives, or otherwise trying to force progress. This is the wrong approach for many long-running tasks, which are better served by a strategy of sustained attention. 人工智能智能体越来越多地被要求执行持续数分钟、数小时甚至更长时间的任务。然而，目前智能体行为的默认模式是持续行动：不断发出工具调用、刷新页面、搜索替代方案或以其他方式强行推进进度。对于许多长期任务而言，这并非最佳方案，采用“持续关注”（sustained attention）策略往往效果更好。

Instead, agents should monitor an environment, notice when an external event makes progress possible, then respond promptly without wasting resources while waiting. To measure progress on this class of tasks, we introduce SentinelBench, an open-source benchmark for time-evolving monitoring tasks. 智能体应当监控环境，在外部事件使任务推进成为可能时及时察觉，并在等待期间不浪费资源的情况下迅速做出响应。为了衡量此类任务的进展，我们推出了 SentinelBench，这是一个针对随时间演变的监控任务的开源基准测试。

SentinelBench contains 100 tasks across 10 synthetic web environments, including email, calendars, finance, professional networking, and entertainment. Each environment exposes a live web interface and replays a scripted sequence of events, requiring agents to navigate and reason about web pages whose state shifts underfoot. SentinelBench 包含涵盖 10 个合成 Web 环境的 100 项任务，涉及电子邮件、日历、金融、专业社交和娱乐等领域。每个环境都提供一个实时 Web 界面并重放脚本化的事件序列，要求智能体在状态不断变化的网页中进行导航和推理。

SentinelBench measures task completion, reaction time, and resource use, exposing the tradeoff between responsiveness and cost. We report results across three models and two browser-agent harnesses, establishing performance baselines for future comparison and demonstrating how agent design choices can dramatically impact key metrics. Together, these results show that SentinelBench distinguishes meaningful differences in agent behavior. SentinelBench 通过测量任务完成度、反应时间和资源使用情况，揭示了响应速度与成本之间的权衡。我们报告了三个模型和两个浏览器智能体框架的测试结果，为未来的比较建立了性能基准，并展示了智能体的设计选择如何显著影响关键指标。总之，这些结果表明 SentinelBench 能够区分智能体行为之间的显著差异。