Building an AI Agent Harness from Scratch: The Architecture Between LLM and Agent

从零构建 AI Agent 框架：大模型与智能体之间的架构设计

Everyone talks about the model. Nobody talks about the harness. Give Claude Sonnet or GPT-4o a chat interface and you get a conversational AI. Wrap it in a loop that can call external tools, maintain state across turns, enforce budget limits, and validate its own outputs — and you get an agent. The difference isn’t the LLM. It’s everything around the LLM. 每个人都在谈论模型，却没人谈论“框架”（Harness）。给 Claude Sonnet 或 GPT-4o 一个聊天界面，你得到的是一个对话式 AI。将其封装在一个能够调用外部工具、跨轮次维护状态、强制执行预算限制并验证自身输出的循环中，你才得到了一个智能体（Agent）。区别不在于大模型本身，而在于大模型周围的一切。

The AWS team published a guide on “agent harnesses” this week, and it got me thinking: most tutorials show you how to call an LLM or how to register a tool. Almost none show you the orchestration layer that makes those individual pieces behave as a coherent system. I’ve built agents that run autonomously on production infrastructure 24/7. The mistakes I made early on weren’t about picking the wrong model. They were about skipping the harness — assuming the model would “just figure it out.” It won’t. The harness is what makes an agent reliable, and reliability is the only metric that matters once you move past the demo phase. Here’s how to build one from scratch. AWS 团队本周发布了一份关于“智能体框架”的指南，这让我开始思考：大多数教程只教你如何调用大模型或注册工具，几乎没有教程展示如何构建那个能让这些独立组件协同工作的编排层。我曾构建过在生产环境中 24/7 自主运行的智能体。我早期的错误不在于选错了模型，而在于跳过了框架层——天真地以为模型会“自己搞定”。事实并非如此。框架才是让智能体变得可靠的关键，而一旦脱离了演示阶段，可靠性就是唯一重要的指标。以下是如何从零开始构建一个框架的方法。

What Is an Agent Harness, Really?

究竟什么是智能体框架？

An agent harness is the execution environment that sits between the user and the LLM. It’s not the prompt. It’s not the model. It’s the infrastructure that: 智能体框架是位于用户与大模型之间的执行环境。它不是提示词（Prompt），也不是模型本身，而是负责以下工作的底层架构：

Manages the conversation loop — receiving input, calling the model, routing tool calls, feeding results back, repeating until termination.
管理对话循环 —— 接收输入、调用模型、路由工具调用、反馈结果，并循环执行直到任务结束。
Registers and dispatches tools — maintaining a catalog of callable functions, validating arguments, executing them safely, and returning structured results.
注册与分发工具 —— 维护可调用函数目录、验证参数、安全执行函数并返回结构化结果。
Maintains memory — storing conversation history, injecting relevant context, compressing old messages to stay within context limits.
维护记忆 —— 存储对话历史、注入相关上下文、压缩旧消息以保持在上下文限制内。
Enforces guardrails — limiting token budgets, capping tool call counts, preventing infinite loops, blocking dangerous actions.
执行护栏机制 —— 限制 Token 预算、限制工具调用次数、防止无限循环、拦截危险操作。
Handles failures — retrying on transient errors, degrading gracefully when a tool is unavailable, escalating to human review when confidence is low.
处理故障 —— 在瞬时错误时重试、在工具不可用时优雅降级、在置信度低时请求人工介入。

Without a harness, you have a stateless API call. With a harness, you have a system. 没有框架，你只是在进行无状态的 API 调用；有了框架，你才拥有了一个系统。

The Minimal Agent Harness

最小化智能体框架

Let’s start with the smallest useful version. A harness needs three things: a model interface, a tool registry, and a loop. 让我们从最精简的可用版本开始。一个框架需要三样东西：模型接口、工具注册表和循环逻辑。

import json
from typing import Callable, Any
from dataclasses import dataclass, field

@dataclass
class Tool:
    name: str
    description: str
    parameters: dict # JSON Schema
    fn: Callable

class AgentHarness:
    def __init__(self, model, system_prompt: str = ""):
        self.model = model
        self.system_prompt = system_prompt
        self.tools: dict[str, Tool] = {}
        self.max_iterations = 10

    def register_tool(self, tool: Tool):
        self.tools[tool.name] = tool

    def tool_list(self) -> list[dict]:
        return [
            {"type": "function", "function": {
                "name": t.name,
                "description": t.description,
                "parameters": t.parameters,
            }} for t in self.tools.values()
        ]

    def run(self, user_input: str) -> str:
        messages = [
            {"role": "system", "content": self.system_prompt},
            {"role": "user", "content": user_input},
        ]
        for i in range(self.max_iterations):
            response = self.model.chat(
                messages=messages,
                tools=self.tool_list() if self.tools else None,
            )
            if not response.tool_calls:
                return response.content
            
            messages.append(response.message)
            for call in response.tool_calls:
                tool = self.tools.get(call.function.name)
                if not tool:
                    result = f"Error: Unknown tool '{call.function.name}'"
                else:
                    try:
                        args = json.loads(call.function.arguments)
                        result = tool.fn(**args)
                    except Exception as e:
                        result = f"Error: {type(e).__name__}: {e}"
                messages.append({"role": "tool", "content": str(result), "tool_call_id": call.id})
        return "Max iterations reached."

That’s the skeleton. It loops: call model, check for tool calls, execute, feed back. Seven lines of core logic. It works for demos. It breaks in production. Let’s see why. 这就是骨架。它循环执行：调用模型、检查工具调用、执行、反馈。核心逻辑只有七行。它适用于演示，但在生产环境中会崩溃。让我们看看原因。

Problem 1: The Tool Registry Lies

问题一：工具注册表的“谎言”

You register a tool, the agent calls it, and it crashes because input validation is wrong. The tool description promised certain parameters, the model complied, but the underlying function has tighter requirements. This isn’t the model’s fault — it’s a harness problem: the tool registry should validate before dispatch. 你注册了一个工具，智能体调用它时却崩溃了，因为输入验证失败。工具描述承诺了某些参数，模型也照做了，但底层函数却有更严格的要求。这不是模型的错，而是框架的问题：工具注册表应该在分发前进行验证。

class ToolRegistry:
    def __init__(self):
        self.tools: dict[str, Tool] = {}
        self.call_counts: dict[str, int] = {}

    def register(self, tool: Tool):
        self.tools[tool.name] = tool
        self.call_counts[tool.name] = 0

    def validate_call(self, tool_name: str, arguments: dict) -> tuple[bool, str]:
        if tool_name not in self.tools:
            return False, f"Unknown tool: {tool_name}"
        schema = self.tools[tool_name].parameters
        for field in schema.get("required", []):
            if field not in arguments:
                return False, f"Missing required parameter: {field}"
        for arg_name, arg_value in arguments.items():
            if arg_name not in schema.get("properties", {}):
                return False, f"Unexpected parameter: {arg_name}"
        return True, "OK"

    def execute(self, tool_name: str, arguments: dict) -> Any:
        self.call_counts[tool_name] += 1
        return self.tools[tool_name].fn(**arguments)

The registry acts as a gatekeeper, not just a dispatcher. Before any tool fires, the harness validates existence, required fields, type correctness, and hallucinated parameters. This catches 60-70% of tool-call errors before they reach application code. 注册表充当了守门人的角色，而不仅仅是分发器。在任何工具触发之前，框架都会验证其存在性、必填字段、类型正确性以及是否存在幻觉参数。这可以在错误到达应用代码之前拦截 60-70% 的工具调用错误。

Problem 2: Memory Bloat Kills Context

问题二：内存膨胀导致上下文失效

Ten turns in, the conversation contains the original prompt, four tool call/response pairs, and a partial draft. The context window is filling up. By turn 20, the model starts forgetting the system prompt. The solution is intelligent context management: compress what you don’t need, preserve what you do. 对话进行到第十轮时，包含了原始提示词、四对工具调用/响应以及部分草稿。上下文窗口正在被填满。到了第二十轮，模型开始忘记系统提示词。解决方案是智能的上下文管理：压缩不需要的内容，保留必要的内容。

(Note: The provided code snippet was truncated in the original text, so the translation ends here.) (注：原文提供的代码片段在此处截断，故翻译至此。)