NOVA: Fundamental Limits of Knowledge Discovery Through AI

NOVA：通过人工智能进行知识发现的基本局限性

Abstract: Can AI systems discover genuinely new knowledge through iterative self-improvement, and if so, at what cost? We introduce the NOVA framework, which models the common “generate, verify, accumulate, retrain” loop as an adaptive sampling process over a knowledge space.

摘要： 人工智能系统能否通过迭代式的自我改进发现真正的新知识？如果可以，其代价又是多少？我们引入了 NOVA 框架，该框架将常见的“生成、验证、积累、再训练”循环建模为知识空间上的自适应采样过程。

We identify sufficient conditions under which accumulated genuine knowledge eventually covers a finite domain, and show how their violations produce distinct failure modes: contamination, forgetting, exploration failure, and acceptance failure.

我们确定了积累的真实知识最终能够覆盖有限领域的充分条件，并展示了当这些条件被破坏时会产生哪些不同的失效模式：污染、遗忘、探索失败和接受失败。

We then analyze imperfect verification and identify a contamination trap: as easy-to-find knowledge is exhausted, the model mass assigned to new valid artifacts shrinks, so even small false-positive rates can cause invalid artifacts to enter the knowledge base faster than genuine discoveries.

随后，我们分析了不完美的验证过程，并识别出一种“污染陷阱”：随着易于发现的知识被耗尽，模型分配给新有效产物的权重会缩小，因此即使是很小的误报率，也可能导致无效产物进入知识库的速度超过真实发现的速度。

We clarify that Good–Turing estimation is a local batch-diversity diagnostic, not an estimator of the historically undiscovered valid mass that governs long-term discovery.

我们澄清了 Good–Turing 估计是一种局部的批次多样性诊断工具，而非用于估计决定长期发现历史的、尚未被发现的有效知识总量的估计器。

Under a separate tail-equivalence assumption relating the model’s effective discovery distribution to a Zipf law with exponent $\alpha>1$, we prove that the cumulative generation cost required to obtain $D$ distinct genuine discoveries satisfies $R_{\mathrm{cum}}(D)=\Theta(c_{\mathrm{gen}}D^\alpha)$, where $c_{\mathrm{gen}}$ is the per-candidate generation cost. This scaling law quantifies asymptotic diminishing returns as the discovery frontier advances.

在模型有效发现分布与指数 $\alpha>1$ 的齐夫定律（Zipf law）相关的尾部等价假设下，我们证明了获得 $D$ 个不同真实发现所需的累积生成成本满足 $R_{\mathrm{cum}}(D)=\Theta(c_{\mathrm{gen}}D^\alpha)$，其中 $c_{\mathrm{gen}}$ 为每个候选对象的生成成本。这一缩放定律量化了随着发现前沿推进而产生的渐进式收益递减。

Finally, we formalize human amplification through guidance, generation, and verification, explaining why expert input is most valuable near autonomous exploration barriers.

最后，我们通过指导、生成和验证对人类增强作用进行了形式化描述，并解释了为什么专家输入在自主探索障碍附近最具价值。