Why I chose MCP over RAG for live infrastructure auditing

为什么我选择用 MCP 而非 RAG 来进行实时基础设施审计

I’ve been working on a project to audit distributed hardware infrastructure — devices spread across multiple sites, each running firmware that needs to stay compliant with a central policy. Pretty standard enterprise ops problem. My first instinct was RAG. Everyone reaches for RAG. You embed your documents, stand up a vector store, and your agent can reason over your data. I’ve built RAG pipelines before, they work well, so I started there. 我一直在做一个审计分布式硬件基础设施的项目——设备分布在多个站点，每台设备运行的固件都需要符合中央策略。这是一个非常标准的企业运维问题。我的第一直觉是使用 RAG（检索增强生成）。每个人都会想到 RAG：嵌入文档、建立向量数据库，然后让智能体基于数据进行推理。我以前构建过 RAG 流水线，它们运行良好，所以我从这里开始。

Three days in, I switched direction. The moment I realized RAG wasn’t the right fit I was testing the agent against a scenario where a device had failed a firmware check at 2am. The agent reported it as compliant. The problem wasn’t the model. The problem was that the data the agent was reasoning over was from an embedded snapshot I’d generated two days earlier. The device had drifted since then. The vector store didn’t know — it can’t know. It’s a snapshot by design. That works fine for a documentation assistant. For infrastructure audit it’s a problem, because you need to know what’s happening now, not what was true when you last ran the embedding pipeline. 三天后，我改变了方向。我意识到 RAG 不合适的那一刻，是我正在测试一个场景：一台设备在凌晨 2 点未能通过固件检查，但智能体却报告它符合要求。问题不在于模型，而在于智能体推理的数据来自我两天前生成的嵌入快照。自那时起，设备状态已经发生了偏移。向量数据库不知道这一点——它也不可能知道，因为它本质上就是快照。这对于文档助手来说没问题，但对于基础设施审计来说是个大问题，因为你需要知道的是“现在”发生了什么，而不是上次运行嵌入流水线时的情况。

What I needed wasn’t retrieval — it was access. Here’s the reframe that changed how I thought about this. RAG answers the question: what documents are relevant to this query? What I actually needed to answer was: what is the current state of device X right now? Those are different questions. One is a search problem. The other is a database query. I was using the wrong tool. The inventory — firmware versions, device health, site assignments — lives in a SQLite database. The compliance policy lives in a structured text file. Neither of these is a document in any meaningful sense. Chunking them and embedding them into a vector store was me forcing square data into a round hole because that’s what I knew how to do. 我需要的不是检索，而是访问。以下是改变我思考方式的重构：RAG 回答的问题是“哪些文档与此查询相关？”，而我真正需要回答的是“设备 X 此刻的当前状态是什么？”这是两个不同的问题。一个是搜索问题，另一个是数据库查询问题。我用错了工具。库存信息（固件版本、设备健康状况、站点分配）存储在 SQLite 数据库中，合规策略存储在结构化文本文件中。从任何意义上讲，它们都不是文档。将它们分块并嵌入到向量数据库中，就像是强行把方块塞进圆孔里，仅仅因为我只知道怎么做这个。

I built an MCP (Model Context Protocol) server that exposes it as tools the agent can call: • get_inventory() — returns live device state, current to the second • query_policy() — reads the policy file and returns the requirements • flag_violation() — marks a device non-compliant with structured metadata The agent calls these the same way your application code calls an API. No embedding pipeline. No staleness problem. No guessing at similarity scores for what is fundamentally a structured query. 我构建了一个 MCP（模型上下文协议）服务器，将其作为智能体可以调用的工具： • get_inventory() — 返回实时设备状态，精确到秒 • query_policy() — 读取策略文件并返回要求 • flag_violation() — 使用结构化元数据标记设备不合规智能体调用这些工具的方式，就像你的应用程序代码调用 API 一样。没有嵌入流水线，没有数据陈旧问题，也不需要对本质上是结构化查询的内容去猜测相似度分数。

The gateway nobody talks about

没人谈论的网关

One thing I’d push back on in most agent tutorials — they wire the LLM directly to the frontend and call it done. I put a FastAPI gateway in between, and I’d do it again every time. The practical reason: NVIDIA NIM credits aren’t free. A misconfigured client or a runaway loop can drain your quota in minutes if there’s nothing between the UI and the model. The gateway enforces rate limits per IP before a single token is generated. Saved me actual money during development. 大多数智能体教程中我反对的一点是：它们直接将 LLM 连接到前端就完事了。我在中间加了一个 FastAPI 网关，而且我每次都会这么做。实际原因是：NVIDIA NIM 的额度不是免费的。如果 UI 和模型之间没有中间层，配置错误的客户端或失控的循环可能会在几分钟内耗尽你的配额。网关会在生成任何 token 之前按 IP 强制执行速率限制。这在开发过程中确实为我省了钱。

The better reason: not every query needs the full audit agent. Simple questions — how many nodes are in Bellevue? — don’t need a multi-step LangGraph agent burning Gemini 2.5 tokens. The gateway classifies intent and routes accordingly. Simple queries go to a lighter NIM worker. Full compliance audits go to the Gemini agent. It also centralises auth and logging in one place, which matters when you need to show a security team exactly what the agent did and when. 更好的理由是：并非每个查询都需要完整的审计智能体。简单的问题（例如“Bellevue 有多少个节点？”）不需要多步 LangGraph 智能体来消耗 Gemini 2.5 的 token。网关会对意图进行分类并进行相应路由。简单查询发送给较轻量的 NIM worker，完整的合规审计则发送给 Gemini 智能体。它还将身份验证和日志记录集中在一处，当你需要向安全团队展示智能体在何时做了什么时，这一点至关重要。

The Judge

裁判员（The Judge）

This is the piece I’m most glad I built, and the one I almost skipped. Every response — whether it came from the NIM worker or the Gemini agent — passes through a secondary LLM before it reaches the user. I call it the Judge. Its only job is to read the agent’s output, check it independently against the policy file, and decide whether the reasoning holds up. 这是我最庆幸自己构建的部分，也是我差点跳过的部分。每一个响应——无论是来自 NIM worker 还是 Gemini 智能体——在到达用户之前都会经过第二个 LLM。我称之为“裁判员”。它的唯一工作就是读取智能体的输出，根据策略文件独立进行检查，并判断其推理是否成立。

During testing, the Judge caught something the main agent missed. The agent had correctly identified a non-compliant firmware version, but applied a remediation rule that belonged to a different device category. The logic was sound — it just used the wrong rule. The Judge caught it because it reads the policy independently, without inheriting whatever context the main agent had accumulated during its reasoning loop. That independence is the point. If the Judge just re-reads the agent’s own context, it’s not really checking anything. You want it reading from the source, fresh. 在测试期间，裁判员发现了一些主智能体遗漏的问题。智能体正确识别出了不合规的固件版本，但应用了属于另一类设备的修复规则。逻辑是通顺的，只是用了错误的规则。裁判员之所以能发现，是因为它独立读取策略，而不会继承主智能体在推理循环中积累的任何上下文。这种独立性正是关键所在。如果裁判员只是重新读取智能体自己的上下文，那它实际上什么也没检查。你需要它从源头读取，保持新鲜。

Humans stay in the loop

人类保持在回路中

The agent can suggest remediation — here’s the CLI command to fix the firmware drift on node 7. It cannot run it. There’s a hard gate in the LangGraph state machine. Suggest remediation and execute remediation are separate nodes, and the only path between them runs through a human decision in the UI. An architect clicks Approve. Then and only then does the write operation touch the database. For infrastructure this felt like the right call. The cost of a false positive — a remediation that runs when it shouldn’t — is much higher than the cost of an extra approval click. 智能体可以建议修复方案——例如“这是修复节点 7 固件偏移的 CLI 命令”，但它不能执行。LangGraph 状态机中有一个硬性门控。“建议修复”和“执行修复”是独立的节点，它们之间唯一的路径必须经过 UI 中的人工决策。架构师点击“批准”，然后写操作才会触及数据库。对于基础设施来说，这感觉是正确的选择。误报（即在不该运行时执行了修复）的代价远高于多点一次批准的代价。

What I’d do differently

我会做出的不同选择

Two things. I’d instrument RAGAS metrics from day one. I ended up retrofitting evaluation on the agent’s audit outputs and found gaps I’d been manually poking at for weeks. Faithfulness and context relevancy scores would have surfaced those faster. And I’d write the red-team report in parallel, not after. I know what failure modes the Judge catches now, but I reconstructed most of that knowledge from memory rather than documenting it as I found it. A live failure log from the start would’ve made that report much sharper. 两件事。第一，我会从第一天起就加入 RAGAS 指标。我最终不得不对智能体的审计输出进行事后评估，并发现了那些我手动排查了数周的漏洞。忠实度和上下文相关性分数本可以更快地揭示这些问题。第二，我会并行编写红队报告，而不是事后编写。我现在知道裁判员能捕捉到哪些故障模式，但我大部分知识是从记忆中重构的，而不是在发现时就记录下来。如果从一开始就有实时故障日志，那份报告会更精准。

The short version

总结

RAG is the right tool for knowledge retrieval over static content. It’s a less natural fit when your agent needs to query live structured data and act on what it finds. MCP let me give the agent real database access through a typed tool interface — no embedding pipeline, no staleness, no similarity search on what is fundamentally a relational query. For infrastructure audit, that was the right call. RAG 是针对静态内容进行知识检索的正确工具。但当你的智能体需要查询实时结构化数据并根据发现采取行动时，它就不那么自然了。MCP 让我通过类型化的工具接口赋予了智能体真正的数据库访问权限——没有嵌入流水线，没有数据陈旧，也不需要对本质上是关系型查询的内容进行相似度搜索。对于基础设施审计来说，这是正确的选择。