The Five Faculties: A Tour of SAFi's Cognitive Architecture
The Five Faculties: A Tour of SAFi’s Cognitive Architecture
五大职能:SAFi 认知架构巡礼
Most attempts at AI governance treat alignment as a prompt-level concern. You write a system message, hope the model follows it, and accept that any sufficiently creative attacker can talk the model into ignoring it. The Self-Alignment Framework Interface (SAFi) takes a different approach. Instead of asking a single LLM to judge its own output, SAFi splits cognition across five specialized faculties, each with a distinct role, a defined interface, and no ability to overstep its bounds. The result is a governed AI architecture that decouples generation from evaluation from execution. Let’s walk through each faculty in order, following the actual loop the orchestrator runs on every turn.
大多数人工智能治理的尝试都将对齐(Alignment)视为提示词层面的问题。你编写一条系统消息,寄希望于模型能够遵循它,并默认任何足够有创意的攻击者都能诱导模型忽略该指令。自对齐框架接口(SAFi)采取了不同的方法。SAFi 没有要求单个大语言模型(LLM)去评估自己的输出,而是将认知拆分为五个专门的职能部门,每个部门都有明确的角色、定义的接口,且无法越权。其结果是一种受控的 AI 架构,将生成、评估与执行解耦。让我们按照编排器(Orchestrator)在每一轮运行的实际循环,依次了解每个职能部门。
Phase Zero: The Pre-Generation Barrier
第零阶段:生成前屏障
Before the Intellect ever sees a user prompt, the Phase Zero gate (phase_zero.py) runs a deterministic security scan. It checks injection signatures from a threat intelligence module, per-persona blacklisted phrases, and an entropy-based heuristic that catches indirect prompt injection attempts (the so-called “ancient text” pattern where a high-entropy blob contains embedded instruction markers). Phase Zero makes zero LLM calls. If it flags a threat, the orchestrator short-circuits immediately to a governed redirect, and the Intellect is never exposed to adversarial content.
在“智力”(Intellect)模块接触到用户提示词之前,第零阶段网关(phase_zero.py)会运行确定性的安全扫描。它会检查来自威胁情报模块的注入特征、各角色的黑名单短语,以及一种基于熵的启发式算法,用于捕获间接提示词注入尝试(即所谓的“古老文本”模式,其中高熵数据块包含嵌入的指令标记)。第零阶段不进行任何 LLM 调用。如果它标记了威胁,编排器会立即短路并跳转至受控重定向,从而确保“智力”模块永远不会接触到对抗性内容。
1. Synderesis: The Immutable Constitution
1. 良知(Synderesis):不可篡改的宪法
The Synderesis faculty (synderesis.py) is the system’s constitution compiler. Before any prompt is processed, Synderesis defines the governance policies, value weights, and scope boundaries that every other faculty will reference. It exposes PERSONAS, GOVERNANCE_MAP, and functions like get_profile, list_profiles, and assemble_agent. At runtime, Synderesis is read-only. Its policies cannot be changed mid-conversation, which makes social engineering against the value system structurally impossible.
“良知”职能(synderesis.py)是系统的宪法编译器。在处理任何提示词之前,“良知”会定义治理策略、价值权重和范围边界,供其他所有职能部门参考。它公开了 PERSONAS(角色)、GOVERNANCE_MAP(治理映射)以及 get_profile、list_profiles 和 assemble_agent 等函数。在运行时,“良知”是只读的。其策略无法在对话中途更改,这使得针对价值体系的社会工程学攻击在结构上变得不可能。
2. Intellect: The Generative Engine (Air-Gapped)
2. 智力(Intellect):生成引擎(物理隔离)
The Intellect (intellect.py) is the only faculty that talks to an LLM for generation. It parses RAG context, conversation history, Spirit feedback, and the user prompt to produce a typed intent. That intent is either a text response or a tool call proposal. The critical architectural invariant is the Air Gap: the Intellect never executes tools. It returns tool calls as proposals for the Will to approve. The generate method returns a 3-tuple of (intent, reflection, retrieved_context), and the orchestrator routes everything through the Will before any action is taken.
“智力”(intellect.py)是唯一与 LLM 进行生成交互的职能部门。它解析 RAG 上下文、对话历史、来自“精神”模块的反馈以及用户提示词,以生成类型化的意图。该意图要么是文本回复,要么是工具调用建议。关键的架构不变性在于“物理隔离”(Air Gap):智力模块从不执行工具。它将工具调用作为建议返回,等待“意志”模块批准。generate 方法返回一个包含 (意图, 反思, 检索到的上下文) 的三元组,在采取任何行动之前,编排器会将所有内容通过“意志”模块进行路由。
3. Will: The Deterministic Gatekeeper
3. 意志(Will):确定性守门人
The Will (will.py) is pure Python with zero LLM calls. It doesn’t deliberate or negotiate. It runs strict structural passes, checking syntax, required exclusions, and user invariants. If a check fails, the Will vetoes the proposal immediately. The Will distinguishes between two failure modes. A hard-gate breach (a non-negotiable value with hard_gate=true scoring at or below -1.0) is caught deterministically and routed directly to a governed redirect with no rewrite. Everything else flows into an aggregate alignment score A_t in [0, 1]. If that score falls below the configurable threshold (default 0.5), the Will triggers a single Reflexion Loop: the Intellect rewrites the response using the persona’s coaching directive, then the Conscience and Spirit re-audit the corrected draft. If the rewrite still fails, the behavior diverges. A low alignment score is treated as a soft quality signal the Will commits the best available draft with its honest low score recorded. Only a residual critical (ethical) violation routes to a governed redirect.
“意志”(will.py)是纯 Python 代码,不进行任何 LLM 调用。它不进行审议或协商。它运行严格的结构化检查,核对语法、必要的排除项和用户不变性。如果检查失败,“意志”会立即否决该提议。“意志”区分两种失败模式。硬性门槛违规(即 hard_gate=true 且得分小于或等于 -1.0 的不可协商价值)会被确定性地捕获,并直接路由至受控重定向,无需重写。其他所有情况都会汇入 [0, 1] 区间的聚合对齐分数 A_t。如果该分数低于可配置的阈值(默认为 0.5),“意志”会触发一次“反思循环”:智力模块使用角色的指导指令重写回复,然后由“良心”和“精神”模块重新审计修正后的草稿。如果重写仍然失败,行为会发生分歧。较低的对齐分数被视为软质量信号,“意志”会提交当前可用的最佳草稿,并记录其真实的低分。只有残留的严重(伦理)违规才会路由至受控重定向。
4. Conscience: The Analytical Auditor
4. 良心(Conscience):分析审计员
The Conscience (conscience.py) is a secondary LLM call that evaluates the Intellect’s draft against the policy’s weighted value set. For each value, it produces a score on a continuous scale from -1.0 (absolute violation) to +1.0 (perfect alignment), with a confidence interval. This compliance ledger (L_t) is the mathematical judgment that the Will and Spirit depend on. The Conscience also has an evaluate_redirect method for auditing the quality of governed redirect messages on criteria like clarity, helpfulness, and tone. This ensures that even when SAFi refuses a request, it does so respectfully and provides guidance.
“良心”(conscience.py)是第二次 LLM 调用,用于根据策略的加权价值集评估“智力”模块的草稿。对于每个价值项,它都会在 -1.0(绝对违规)到 +1.0(完美对齐)的连续量表上给出一个分数,并附带置信区间。这份合规账本(L_t)是“意志”和“精神”模块所依赖的数学判断依据。“良心”还拥有一个 evaluate_redirect 方法,用于根据清晰度、有用性和语气等标准审计受控重定向消息的质量。这确保了即使在 SAFi 拒绝请求时,它也能以尊重的态度进行回复并提供指导。
5. Spirit: The Long-Term Integrator
5. 精神(Spirit):长期整合者
The Spirit (spirit.py) is pure Python using NumPy. It ingests the Conscience ledger, scales the continuous scores into a consolidated metric from 1 to 10 (S_t), and updates the system’s moving average (mu_t) using an exponential moving average with a configurable beta parameter. A high beta (e.g., 0.9) means long memory, slow adaptation. A low beta (e.g., 0.1) means fast adaptation to recent behavior. The Spirit also computes behavioral drift (d_t), quantifying how much the current turn’s ethical vector diverges from the historical average. This gives operators a mathematical signal for detecting gradual alignment erosion before it becomes critical. The result is that SAFi doesn’t just evaluate individual outputs it tracks the agent’s character over time.
“精神”(spirit.py)是使用 NumPy 的纯 Python 模块。它接收“良心”账本,将连续分数缩放为 1 到 10 的综合指标(S_t),并使用带有可配置 beta 参数的指数移动平均线更新系统的移动平均值(mu_t)。高 beta 值(例如 0.9)意味着长记忆、慢适应;低 beta 值(例如 0.1)意味着对近期行为的快速适应。“精神”还会计算行为漂移(d_t),量化当前轮次的伦理向量与历史平均值的偏离程度。这为操作员提供了一个数学信号,以便在对齐侵蚀变得严重之前检测到它。其结果是,SAFi 不仅仅评估单个输出,它还在持续追踪代理(Agent)的“性格”演变。
Why Separation Matters
为什么分离至关重要
This cognitive architecture solves a real engineering problem. Monolithic LLMs face an inherent conflict: the same model that generates a response must also evaluate whether that response is compliant. SAFi’s benchmarks show that unguarded baselines fail adversarial prompts at a 30-point higher rate than the governed pipeline. By splitting generation (Intellect) from evaluation (Conscience) from execution (Will), SAFi eliminates that conflict. The governance layer is model-independent the same deterministic gates fire whether the underlying LLM is GPT-5, Claude, or an open-source fine-tune. You can swap the model without rewriting the governance. Every step of the loop is audited and logged, giving operators an immutable trail showing exactly why a machine determined an action was compliant. If you are building production AI agents where governance is not optional, the five-faculty architecture is worth studying closely.
这种认知架构解决了一个实际的工程问题。单体大语言模型面临着内在冲突:生成回复的模型必须同时评估该回复是否合规。SAFi 的基准测试显示,未受保护的基准模型在面对对抗性提示词时,失败率比受控流水线高出 30 个百分点。通过将生成(智力)、评估(良心)与执行(意志)分离,SAFi 消除了这种冲突。治理层与模型无关——无论底层 LLM 是 GPT-5、Claude 还是开源微调模型,相同的确定性网关都会触发。你可以在不重写治理逻辑的情况下更换模型。循环的每一步都经过审计和记录,为操作员提供了一条不可篡改的轨迹,准确展示了机器为何判定某项行动是合规的。如果你正在构建治理不可或缺的生产级 AI 代理,那么这种“五大职能”架构非常值得深入研究。