I built a production risk scanner in one day, here's what it caught

I built a production risk scanner in one day, here’s what it caught

我用一天时间构建了一个生产环境风险扫描器,以下是它的检测成果

If you’re an SRE or DevOps engineer — try blastradar.vercel.app and tell me what you actually think. The tool BlastRadar scores any code diff for production risk — paste a diff, get a 1-10 score, a plain English explanation of what could break, and a blast radius showing which systems are affected.

如果你是一名 SRE 或 DevOps 工程师,请尝试访问 blastradar.vercel.app 并告诉我你的真实想法。BlastRadar 这个工具可以为任何代码差异(diff)进行生产环境风险评分——只需粘贴一段 diff,即可获得 1-10 分的评分、关于可能导致故障的通俗解释,以及显示受影响系统的“爆炸半径”。

Last month a Cursor agent deleted a company’s entire production database in 9 seconds. Amazon’s AI coding mandate caused a 13-hour AWS Cost Explorer outage. A developer using Claude Code wiped 2.5 years of course submissions in one command. AI coding agents are merging code into production faster than any human can review it. SREs are the ones getting paged at 2am when it breaks. I wanted to see if I could build something useful for this problem in a single day. Here’s what I built and what I learned.

上个月,一个 Cursor 智能体在 9 秒内删除了某公司的整个生产数据库。亚马逊的 AI 编码指令导致 AWS Cost Explorer 宕机了 13 个小时。一位开发者在使用 Claude Code 时,通过一条命令清除了 2.5 年的课程提交记录。AI 编码智能体将代码合并到生产环境的速度,远超人类的审核速度。当系统崩溃时,凌晨 2 点被传呼的总是 SRE。我想看看能否在一天之内针对这个问题构建出有用的工具。以下是我构建的内容以及我的心得。

What it caught: I tested it on two diffs. A documentation PR — scored 1/10. Output: “zero production impact, documentation only.” Correct. A database config change pointing production at a read replica with a connection pool of 2 — scored 9/10. It caught three things: all writes would fail immediately against a read-only replica, a pool of 2 causes connection starvation under any real load, and dropping timeout from 5000ms to 500ms would trigger a retry storm on an already broken connection pool. That last one — the retry storm — is second-order thinking. The tool didn’t just list what changed, it reasoned about the cascade.

它的检测成果:我用两个 diff 进行了测试。一个是文档 PR,评分 1/10。输出结果:“零生产影响,仅涉及文档。”结果正确。另一个是数据库配置变更,将生产环境指向了一个连接池仅为 2 的只读副本,评分 9/10。它捕捉到了三个问题:所有写入操作会立即在只读副本上失败;在任何实际负载下,连接池为 2 都会导致连接枯竭;将超时时间从 5000ms 降至 500ms 会在已经崩溃的连接池上引发重试风暴。最后一点——重试风暴——属于二阶思维。该工具不仅列出了变更内容,还推导出了连锁反应。

What I don’t know yet: Whether this is actually useful in a real SRE workflow. Whether the risk scoring is calibrated correctly for real production diffs. Whether existing tools like CodeRabbit already solve this well enough. That’s why I’m posting here. If you’re an SRE or DevOps engineer — try it at blastradar.vercel.app and tell me what you actually think. What does it get wrong? What would make it useful in your daily workflow?

我尚不清楚的问题:它在真实的 SRE 工作流中是否真的有用?风险评分对于真实的生产环境 diff 是否校准得当?现有的工具(如 CodeRabbit)是否已经足够好地解决了这个问题?这就是我在此发帖的原因。如果你是一名 SRE 或 DevOps 工程师,请在 blastradar.vercel.app 试用它并告诉我你的真实想法。它哪里做得不对?什么功能会让它在你的日常工作中变得有用?