I built an open source SDK to catch AI agent regressions before they ship.
I built an open source SDK to catch AI agent regressions before they ship.
我构建了一个开源 SDK,旨在 AI Agent 发布前捕获回归问题。
I built an open source SDK to catch AI agent regressions before they ship. You fix a bug in your agent. A week later you change the prompt or swap the model. The same bug comes back. Nobody notices until a user does. Regular software has regression tests for this. AI agents mostly do not. So I built replayd.
我构建了一个开源 SDK,旨在 AI Agent 发布前捕获回归问题。当你修复了 Agent 中的一个 Bug,一周后你修改了提示词(Prompt)或更换了模型,同样的 Bug 又出现了,而且直到用户发现之前都没人察觉。常规软件有回归测试来解决这个问题,但 AI Agent 大多没有。因此,我开发了 replayd。
When your agent fails, you capture that run and save it as a test. Before you ship a new version, you replay the saved failures against it. If the same failure comes back, you catch it before your users do. pip install replayd
当你的 Agent 运行失败时,你可以捕获该次运行并将其保存为测试用例。在发布新版本之前,你可以针对新版本重放这些已保存的失败案例。如果同样的错误再次出现,你就能在用户发现之前将其拦截。通过 pip install replayd 即可安装。
The interesting part was grading. You cannot use exact output matching because LLMs are non-deterministic. So replayd does not check the text. It checks whether the specific failure came back. Structural failures get deterministic assertions. Semantic ones get an LLM as judge. You assert on what the agent did, not what it said.
最有趣的部分在于评分机制。由于大语言模型(LLM)具有非确定性,你无法使用精确的输出匹配。因此,replayd 不会检查文本内容,而是检查特定的失败是否再次发生。结构性失败使用确定性断言(deterministic assertions),语义性失败则使用 LLM 作为评判者。你断言的是 Agent “做了什么”,而不是它“说了什么”。
It is v0.1.1, early, rough edges, but the core loop works. Zero runtime dependencies in the core. Framework agnostic. GitHub: github.com/TaimoorKhan10/replayd
目前版本为 v0.1.1,处于早期阶段,可能还有些粗糙,但核心循环已经可以运行。核心代码零运行时依赖,且与框架无关。GitHub 地址:github.com/TaimoorKhan10/replayd
If you are running agents in production I would love your feedback on the grading approach. What are you catching manually right now that you wish was automated?
如果你正在生产环境中运行 Agent,我很希望能听到你对这种评分方法的反馈。目前有哪些问题是你正在手动捕获、并希望能够实现自动化的?