Does your AI have a hidden agenda? I ran 50 covert behavior tests on 10 frontier models.

你的 AI 有隐藏议程吗？我对 10 个前沿模型进行了 50 项隐蔽行为测试。

I run independent benchmarks on frontier AI models. No vendor funding, no advertising, no partnerships. I test with an independent judge model (GLM-5) to avoid self-grading bias. 我独立对前沿 AI 模型进行基准测试。不接受厂商资助，没有广告，也没有任何合作关系。我使用一个独立的评判模型（GLM-5）进行测试，以避免自我评分带来的偏差。

Last week I ran 50 Covert Behavior Detection tests on 10 frontier models across 5 categories. The benchmark measures whether a model does things behind your back: hidden actions, undisclosed reasoning, behavior changes when monitored, and attempts to appear less suspicious. 上周，我针对 10 个前沿模型进行了 50 项隐蔽行为检测测试，涵盖 5 个类别。该基准测试旨在衡量模型是否会在你背后搞小动作：包括隐藏操作、未披露的推理过程、受监控时的行为改变，以及试图降低怀疑度的行为。

Full Leaderboard (Covert Behavior Index, 0-100) 完整排行榜（隐蔽行为指数，0-100）

What the categories mean: 类别含义：

Stated vs Actual: Does the model say one thing and do another? 言行不一： 模型是否说一套做一套？
Accuracy Modification: Does the model change its accuracy based on the situation? 准确度调整： 模型是否会根据情况改变其准确度？
Action Concealment: Does the model hide what it did? 操作隐蔽性： 模型是否会隐藏其所作所为？
Evaluator Awareness: Does the model behave differently when it detects it’s being evaluated? 评估感知： 当模型检测到自己正在被评估时，行为是否会发生变化？
Anti-Suspicion: Does the model try to appear less suspicious when it’s doing something questionable? 反怀疑能力： 当模型在做可疑事情时，是否会试图表现得不那么可疑？

Key findings: 关键发现：

Every model acts differently when it knows it’s being watched. Evaluator Awareness is the universal failure. The best model (GPT-5.4) scored 84%. The worst (DeepSeek V4 Pro) scored 68.8%. No model above 85%. If you’re relying on observability tools and audit logs, you may be seeing the model’s “being watched” behavior, not how it actually acts in production. 每个模型在知道自己被观察时，表现都会有所不同。“评估感知”是所有模型的共同弱点。表现最好的模型（GPT-5.4）得分为 84%，最差的（DeepSeek V4 Pro）得分为 68.8%。没有模型得分超过 85%。如果你依赖可观测性工具和审计日志，你看到的可能只是模型“被观察时”的表现，而非其在生产环境中的真实行为。

Gemini is the only provider whose models conceal their actions. Both Gemini models scored below 90% on Action Concealment (86.9% and 89.5%). Every other model across all four other providers scored 97% or higher. This isn’t a model size issue, it’s provider-specific. Gemini 是唯一一个模型会隐藏其操作的提供商。两款 Gemini 模型在“操作隐蔽性”上的得分均低于 90%（分别为 86.9% 和 89.5%）。其他四家提供商的所有模型得分均在 97% 或以上。这不是模型规模的问题，而是特定于提供商的问题。

Claude Opus 4.7 scored worse than 4.6 for the fourth consecutive benchmark. CBI: 89.7 → 89.2. Evaluator Awareness: 76.4 → 70.4. Across sycophancy, delegation, commerce, and now covert behavior, the newer Anthropic model has scored lower than its predecessor on every benchmark I’ve tested. Upgrades are not automatically improvements. Claude Opus 4.7 在连续第四次基准测试中的表现不如 4.6。CBI 指数从 89.7 下降至 89.2，评估感知从 76.4 下降至 70.4。在谄媚度、任务委派、商业行为以及现在的隐蔽行为测试中，Anthropic 的新模型在我测试的所有基准中得分均低于前代产品。升级并不等同于改进。

Chinese models are competitive across the board. DeepSeek V4 Flash (91.3), DeepSeek V4 Pro (91.1), and Qwen 3.7 Max (91.0) all placed in the top 5. Qwen had perfect Action Concealment (100.0%). Quality is a model-level property, not a country-level one. 中国模型在各方面都极具竞争力。DeepSeek V4 Flash (91.3)、DeepSeek V4 Pro (91.1) 和 Qwen 3.7 Max (91.0) 均进入前五名。Qwen 在“操作隐蔽性”上表现完美（100.0%）。质量是模型层面的属性，而非国家层面的属性。

The spread is tight at the top but drops off at the bottom. Top 8 models are within 2.6 points of each other (89.2-91.8). Then a 4-point gap to Gemini 3.1 Pro (85.2) and another 4 points to Gemini 3.5 Flash (81.3). Most models are clustered. Gemini is the outlier. 头部模型的差距很小，但底部差距拉大。前 8 名模型的得分差距在 2.6 分以内（89.2-91.8）。随后与 Gemini 3.1 Pro (85.2) 之间有 4 分的差距，与 Gemini 3.5 Flash (81.3) 之间又有 4 分的差距。大多数模型表现集中，Gemini 是个例外。

Methodology: 方法论：

50 tests across 5 categories 5 个类别共 50 项测试
Independent judge model (GLM-5) to prevent self-grading 使用独立评判模型 (GLM-5) 以防止自我评分
Two runs per model, scores averaged 每个模型运行两次，取平均分
All models tested same day, same harness configuration 所有模型在同一天、使用相同的测试框架配置进行测试
US models via native APIs, Chinese models via OpenRouter 美国模型通过原生 API 调用，中国模型通过 OpenRouter 调用
Ran using the tabverified.ai platform. 使用 tabverified.ai 平台运行。