I Build ML Infrastructure for a Living — Here's Why Hermes Agent Changes the Game for Platform Engineers

I Build ML Infrastructure for a Living — Here’s Why Hermes Agent Changes the Game for Platform Engineers

我以构建机器学习基础设施为生——这就是为什么 Hermes Agent 为平台工程师带来了游戏规则的改变

I’ve spent the past year building NeuroScale — an open-source AI inference platform on Kubernetes. 108 commits. 21 automated smoke checks across 6 milestones. The kind of platform where a developer fills in a Backstage form and gets a production-grade inference endpoint with drift control, policy guardrails, and cost attribution — no kubectl required. I’m telling you this because I need you to understand where I’m coming from when I say: Hermes Agent isn’t just another AI coding assistant. It’s the first agent framework that actually thinks like a platform engineer. I don’t say that lightly.

过去一年，我一直在构建 NeuroScale——一个基于 Kubernetes 的开源 AI 推理平台。经历了 108 次提交，在 6 个里程碑中完成了 21 项自动化冒烟测试。在这个平台上，开发者只需填写一份 Backstage 表单，就能获得一个具备漂移控制、策略护栏和成本归因的生产级推理端点，完全无需使用 kubectl。我之所以告诉你这些，是因为我希望你理解我的立场：Hermes Agent 不仅仅是另一个 AI 编程助手。它是第一个真正像平台工程师一样思考的智能体框架。我并非随口说说。

The Problem Nobody Talks About: AI Agents Are Stateless in a Stateful World

一个无人提及的问题：在有状态的世界里，AI 智能体却是无状态的

Building ML infrastructure teaches you one thing fast: everything is state. Your ArgoCD sync status is state. Your Kyverno policy violations are state. The drift between what’s in Git and what’s running in the cluster — state. The fact that someone ran kubectl apply directly at 2am and broke the GitOps contract — that’s state too. Every AI agent I’ve used before Hermes treats each conversation like a blank canvas. You explain your architecture. You describe the problem. You get a plausible answer. Then you close the tab and do it all over again tomorrow. Groundhog Day for infrastructure debugging. Hermes Agent is architecturally different, and the difference matters specifically for the kind of work platform engineers do.

构建机器学习基础设施会让你迅速明白一件事：一切皆为状态。你的 ArgoCD 同步状态是状态，你的 Kyverno 策略违规是状态，Git 中的配置与集群中运行内容之间的漂移也是状态。甚至有人在凌晨 2 点直接运行 kubectl apply 并破坏了 GitOps 契约，这也是状态。在我使用 Hermes 之前，每一个 AI 智能体都把对话当作一张白纸。你解释架构，描述问题，得到一个合理的答案，然后关闭标签页，第二天一切重来。这简直是基础设施调试版的“土拨鼠之日”。Hermes Agent 在架构上截然不同，这种差异对于平台工程师的工作至关重要。

Three-Layer Memory: What It Actually Means for Infrastructure

三层记忆：这对基础设施意味着什么

Most people writing about Hermes focus on the memory system as a convenience feature. “It remembers your preferences.” “It knows your name.” That’s not what makes it interesting. Hermes runs a three-layer memory architecture: Short-term — current conversation context (same as every other agent); Medium-term — session summaries that persist between conversations, built through periodic “memory nudges”; Long-term — Skill Documents that capture how it solved specific types of problems, stored as reusable procedures.

大多数关于 Hermes 的文章都将记忆系统视为一种便利功能，比如“它能记住你的偏好”或“它知道你的名字”。但这并不是它最有趣的地方。Hermes 运行着一种三层记忆架构：短期记忆——当前的对话上下文（与其他智能体相同）；中期记忆——在对话间持久存在的会话摘要，通过周期性的“记忆提示”构建；长期记忆——捕捉它如何解决特定类型问题的“技能文档”，并将其存储为可重用的流程。

For a platform engineer, this maps directly to something we already understand: runbooks. When I troubleshoot an ArgoCD sync failure, I don’t start from first principles. I check the runbook. Token expiry? Webhook misconfiguration? Sync wave ordering? The runbook encodes prior incident resolution as a procedure. Hermes does this automatically. After roughly 15 tasks, its GEPA loop (Goal → Execute → self-Prompted introspection → Adapt — published at ICLR 2026 as an Oral) kicks in: it reviews its own performance, identifies patterns, and writes new Skill Documents. Agents with 20+ self-generated skills complete similar future tasks 40% faster than fresh instances. That’s not “remembering your name.” That’s an agent building its own runbook library. It’s the difference between a junior on-call engineer and a senior who’s seen every failure mode before.

对于平台工程师来说，这直接对应了我们已经非常熟悉的概念：运行手册（Runbooks）。当我排查 ArgoCD 同步失败时，我不会从头开始推导，而是查看运行手册。是令牌过期？Webhook 配置错误？还是同步波次排序问题？运行手册将过往的事故解决方案编码为流程。Hermes 自动完成了这一过程。在大约 15 个任务后，它的 GEPA 循环（目标 → 执行 → 自我提示反思 → 适应，发表于 ICLR 2026 口头报告）就会启动：它会评估自己的表现，识别模式，并编写新的技能文档。拥有 20 个以上自生成技能的智能体，完成类似未来任务的速度比新实例快 40%。这不仅仅是“记住你的名字”，这是一个正在构建自己运行手册库的智能体。这就是初级值班工程师与见过所有故障模式的高级工程师之间的区别。

Where Hermes Creates Real Value in an ML Platform Stack

Hermes 在机器学习平台栈中创造的真正价值

Abstract possibilities are cheap. Let me be specific about where this matters in a stack like NeuroScale. 1. Configuration Drift Diagnosis. NeuroScale uses ArgoCD with selfHeal: true — drift is auto-corrected. But detecting drift before ArgoCD catches it, and understanding why it happened, is a different problem. Here’s what a Hermes scheduled audit looks like in practice:

抽象的可能性不值一提。让我具体谈谈它在 NeuroScale 这样的技术栈中为何重要。1. 配置漂移诊断。NeuroScale 使用了开启 selfHeal: true 的 ArgoCD，漂移会自动修复。但在 ArgoCD 捕获之前检测到漂移，并理解其发生原因，则是另一个问题。以下是 Hermes 定时审计在实践中的样子：

hermes task add --cron "0 */6 * * *" \
"Check the diff between Git-declared state in infrastructure/apps/ \
and live cluster state. If they diverge, summarize what changed, \
correlate with recent kubectl audit logs, and flag whether the \
change was human-initiated or a controller reconciliation. \
Send results to Telegram."

Most agents can run a diff. Hermes does the part that matters: building a pattern library over time. After a month of audits, it knows that drift in the serving-stack namespace is almost always a Knative autoscaler update (harmless), while drift in kyverno/policies/ is almost always someone bypassing admission control (critical). That context accumulates in Skill Documents. I haven’t seen another agent framework that does this out of the box.

大多数智能体都能运行 diff 命令。但 Hermes 完成了最关键的部分：随着时间推移构建模式库。经过一个月的审计，它知道 serving-stack 命名空间中的漂移几乎总是 Knative 自动扩缩容更新（无害），而 kyverno/policies/ 中的漂移几乎总是有人绕过了准入控制（严重）。这种上下文积累在技能文档中。我还没见过其他开箱即用就能做到这一点的智能体框架。

Here’s what a drift report from Hermes actually looks like after a few weeks of accumulated context:

以下是 Hermes 在积累了几周上下文后，实际生成的漂移报告：

📋 Drift Audit — 2026-05-23 12:00 UTC Cluster: neuroscale-prod Namespaces scanned: 4

✅ serving-stack: 2 diffs detected → Both are Knative autoscaler reconciliations (harmless) → Matches pattern from Skill: “knative-autoscaler-drift” → No action required.

⚠️ kyverno/policies: 1 diff detected → ClusterPolicy “require-resource-limits” modified in-cluster → Not present in Git (infrastructure/policies/) → kubectl audit: manual apply by user “ops-admin” at 03:12 UTC → FLAGGED: Possible admission control bypass. → Recommend: Revert in-cluster change or commit to Git.

📎 Context: This is the 3rd manual policy edit in 14 days. Previous incidents resolved by reverting. See Skill: “kyverno-drift-response” for standard procedure.

Notice the last three lines. That’s not a generic diff. That’s an agent referencing its own operational history — correlating today’s anomaly with patterns it learned from previous audits. A fresh agent instance can’t do that. One with a month of Skill Documents can.

注意最后三行。那不是通用的 diff 输出，而是一个智能体在引用自己的操作历史——将今天的异常与从过往审计中学到的模式进行关联。一个全新的智能体实例做不到这一点，但拥有一个月技能文档的智能体可以。

2. Policy Validation Before Merge NeuroScale enforces 5 Kyverno ClusterPolicies — requiring resource limits, standard labels, non-root containers, no :latest tags. But violations caught at admission mean the deploy already failed. The earlier you catch them, the cheaper the fix. This is where Skill Documents become genuinely powerful. You write one that encodes your specific policies:

2. 合并前的策略验证 NeuroScale 强制执行 5 条 Kyverno 集群策略——要求资源限制、标准标签、非 root 容器、禁止使用 :latest 标签。但在准入阶段才发现违规意味着部署已经失败。越早发现，修复成本越低。这就是技能文档真正强大的地方。你可以编写一个编码了特定策略的文档：

# Skill: NeuroScale Policy Pre-Check
## When to Use
When reviewing PRs that modify files under `apps/` or `infrastructure/`.
## Procedure
1. Check for `owner` and `cost-center` labels on all InferenceService manifests
2. Verify `resources.requests` and `resources.limits` are set
3. Flag any image tag that is `latest` or missing