Your Agent Just Handled That SEV2. Now What?

Your Agent Just Handled That SEV2. Now What?

你的 AI Agent 刚刚处理完一起 SEV2 事故,接下来该做什么?

The outage isn’t the hard part. What’s hard is the minute after the alert fires. Who’s leading? Which Slack channel? What do we tell support? The fix is usually quick. The coordination is what kills you. 事故本身并不是最难的部分。最难的是警报触发后的那一分钟:谁来负责?在哪个 Slack 频道沟通?我们要怎么告知客服?修复过程通常很快,但协调工作才是最让人头疼的。

What agents actually do today AI Agent 目前的实际作用

Teams are starting to route coordination work to agents. Engineer gets a Datadog alert, the agent checks who’s on call, acknowledges, pulls recent deploys, pages the right person, logs everything. Engineer never leaves the editor. The agent didn’t fix anything. It removed the 20 minutes of friction before anyone starts debugging. 团队正开始将协调工作交给 AI Agent。当工程师收到 Datadog 警报时,Agent 会自动检查值班人员、确认警报、拉取最近的部署记录、呼叫相关人员并记录一切。工程师甚至不需要离开代码编辑器。Agent 虽然没有修复任何代码,但它消除了调试开始前那 20 分钟的摩擦成本。

One engineer told me about their first agent-handled SEV2: “I woke up to a full picture. Agent had pulled the deploy from 20 minutes ago, logged the spike, paged me. I fixed it in ten minutes instead of spending thirty figuring out what was happening.” 一位工程师向我描述了他们第一次由 Agent 处理 SEV2 事故的经历:“我醒来时,事故的全貌已经呈现在我面前。Agent 已经拉取了 20 分钟前的部署记录,记录了流量峰值,并呼叫了我。我只用了十分钟就修复了问题,而不是像以前那样花半小时去搞清楚发生了什么。”

Deciding what to hand off 决定哪些工作可以移交

Before the next incident, you need to answer this: what can the agent do without asking? Some things are obvious. Acknowledging incidents, looking up who’s on call, logging to the timeline, gathering context from deploys and logs, paging with full context. Let the agent do those. 在下一次事故发生前,你需要回答这个问题:哪些事情 Agent 可以无需询问直接处理?有些事情显而易见:确认事故、查询值班人员、记录时间线、从部署和日志中收集上下文、携带完整背景信息进行呼叫。让 Agent 去做这些事吧。

Some things should stay with humans. Rollback decisions, customer comms, postmortem conclusions, declaring “we’re resolved.” These need judgment that agents don’t have yet. Then there’s the gray zone. Escalation timing and severity classification. For escalation, pick a number and write it down. For severity, let agents triage but have humans confirm anything customer-facing. 有些事情则应由人类把控:回滚决策、客户沟通、事后复盘结论、宣布“问题已解决”。这些需要 Agent 目前尚不具备的判断力。此外还有灰色地带,比如升级时机和严重程度分类。对于升级,设定一个明确的数值并记录下来;对于严重程度,可以让 Agent 进行初步分诊,但涉及客户层面的判定必须由人工确认。

When the agent is better than your process 当 Agent 比你的流程更出色时

One team told me the agent didn’t improve their incidents. It exposed how broken the process already was. Nobody remembers ten steps at 2 AM. The process was asking humans to do machine work, and the agent made that visible. 一个团队告诉我,Agent 并没有改善他们的事故处理效率,反而暴露了他们现有的流程有多么糟糕。凌晨两点,没人能记住十个步骤。原本的流程是在强迫人类做机器的工作,而 Agent 让这种低效变得显而易见。

Fewer tools, not more 工具要精简,而非堆砌

More tools means more ambiguity. Seventy tools means the agent picks between list_incidents, get_incidents, search_incidents, and query_incidents. Four tools, same thing, different names. Now the agent is guessing. Narrower surfaces make agents more dependable. We kept Runframe’s MCP server to sixteen tools around incident workflows. If it doesn’t help run an incident, it’s out. 工具越多,歧义就越多。如果有 70 个工具,Agent 就得在 list_incidentsget_incidentssearch_incidentsquery_incidents 之间做选择。四个工具做同一件事,只是名字不同,这会让 Agent 陷入猜测。更窄的工具集能让 Agent 更可靠。我们将 Runframe 的 MCP 服务器限制在围绕事故工作流的 16 个工具内。如果它对处理事故没有帮助,那就直接砍掉。

What I’d do 我的建议

Start with SEV3s. Low severity, cheap mistakes, good place to watch how the agent behaves. Don’t let agents near SEV0 until you’ve seen them work on dozens of lower-severity incidents first. 从 SEV3 事故开始。严重程度低、犯错成本小,是观察 Agent 行为的好地方。在看到 Agent 成功处理几十起低级别事故之前,不要让它们接触 SEV0 事故。

Write down delegation boundaries. Not runbooks that say “do X then Y,” but guardrails: never do X without human approval, escalate after Y minutes. Test those before you need them. And pay attention when the agent does something unexpected. That’s it telling you what you forgot to define. 写下授权边界。不是那种写着“先做 X 再做 Y”的运行手册,而是设定护栏:没有人工批准绝不执行 X,Y 分钟后自动升级。在真正需要之前测试这些规则。当 Agent 做出意料之外的行为时,请务必留意,那是在提醒你还有哪些定义被遗漏了。

This is the short version. Full post on runframe.io/blog/your-agent-just-handled-that-sev2. We built Runframe around these ideas. If you’re thinking about agent-driven incident management, the MCP server is open source. 以上是简版内容。完整文章请见 runframe.io/blog/your-agent-just-handled-that-sev2。我们正是围绕这些理念构建了 Runframe。如果你正在考虑基于 Agent 的事故管理,我们的 MCP 服务器已开源。