660 AI Agents Ran 27,000 Experiments. Their Biggest Discovery Was a 2015 Textbook Result.

660 AI Agents Ran 27,000 Experiments. Their Biggest Discovery Was a 2015 Textbook Result.

660 个 AI 智能体进行了 27,000 次实验,最大的发现竟是 2015 年的教科书结论。

On Hyperspace, basic swarms, the math nobody wrote down, and why we built the thing they were missing in a single afternoon. Join us as we traverse multiple whitepapers and agentic memory ideas like a ferret on Adderall. 关于 Hyperspace、基础群体智能、那些无人记录的数学原理,以及为什么我们能在一下午就造出他们所缺失的东西。请跟随我们,像打了兴奋剂的雪貂一样,穿梭于多篇白皮书和智能体记忆理念之间。

Some rabbit holes start with a GitHub link. Someone drops it in social posts on Facebook/Reddit/Discord. No context, just the URL to Github and a single line: Someone just built AGI! Wow! The repo was called hyperspaceai/agi. The name alone should have been a warning. I clicked it anyway because I was curious, of course. 有些“兔子洞”始于一个 GitHub 链接。有人将其发布在 Facebook、Reddit 或 Discord 的社交帖子里。没有上下文,只有一个 GitHub 网址和一行字:“有人刚刚造出了 AGI!哇!”该仓库名为 hyperspaceai/agi。光是这个名字就该是个警示,但我当然还是因为好奇点了进去。

As I delved deeper into the github vibe code abyss, I could see the attraction: a new frontier of swarm bot peer-to-peer networks with the ability to earn base 10 points per epoch of confirmation and crypto tokenomics baked in. Playstation does have something similar created awhile back called Folding@Home—for the PS3 and PCs: https://en.wikipedia.org/wiki/Folding@home — is a distributed computing project aimed to help scientists develop new therapeutics for a variety of diseases by the means of simulating protein dynamics. This includes the process of protein folding and the movements of proteins, and is reliant on simulations run on volunteers’ personal computers. 随着我深入挖掘这个 GitHub 代码深渊,我明白了它的吸引力所在:这是一个群体机器人点对点网络的新前沿,具备在每个确认周期赚取 10 点基础积分的能力,并内置了加密代币经济学。PlayStation 很久以前曾创建过类似的项目,名为 Folding@Home(适用于 PS3 和 PC):https://en.wikipedia.org/wiki/Folding@home。这是一个分布式计算项目,旨在通过模拟蛋白质动力学来帮助科学家开发针对各种疾病的新疗法。这包括蛋白质折叠过程和蛋白质运动,并依赖于志愿者个人电脑上运行的模拟。

The AGI That Wasn’t Hyperspace describes itself as the first distributed AGI system. 660 agents. 27,000 experiments. A peer-reviewed research pipeline running autonomously across a P2P network. The marketing is excellent and captivating, guaranteed to attract lemmings like flies to juicy GitHub stars. The actual results are a different story. 名不副实的 AGI:Hyperspace 将自己描述为第一个分布式 AGI 系统。660 个智能体,27,000 次实验,一个在 P2P 网络上自主运行的同行评审研究流水线。其营销手段极其出色且引人入胜,注定会像吸引苍蝇扑向诱人的 GitHub 星标一样吸引盲从者。但实际结果却是另一回事。

The swarm’s biggest published discovery — the finding that propagated to 23 agents within hours via gossip protocol, the one they highlight as proof the system works — was Kaiming initialization. Kaiming init has been in the PyTorch standard library since 2015. It’s covered in week two of every deep learning course. Kaiming He published the paper eleven years ago. A grad student with a coffee and an afternoon would have found it faster. https://arxiv.org/pdf/1502.01852 该群体发布的最大发现——那个通过 Gossip 协议在数小时内传播给 23 个智能体,并被他们强调为系统有效性证明的发现——竟然是 Kaiming 初始化(Kaiming initialization)。Kaiming 初始化自 2015 年起就已存在于 PyTorch 标准库中,是每门深度学习课程第二周就会讲到的内容。何恺明(Kaiming He)在 11 年前就发表了这篇论文。一个研究生喝杯咖啡,花一下午时间就能更快地找到它。https://arxiv.org/pdf/1502.01852

The infrastructure underneath is genuinely impressive. DiLoCo gradient compression, libp2p gossip, CRDT leaderboards, 32 anonymous nodes completing a collaborative training run in 24 hours. The plumbing is real. I don’t want to dismiss that. But AGI? No. What they built is a parallel random search engine with a shared high score table and excellent branding. 其底层的架构确实令人印象深刻。DiLoCo 梯度压缩、libp2p Gossip 协议、CRDT 排行榜,32 个匿名节点在 24 小时内完成了协作训练。这些底层架构是真实的,我不想否认这一点。但 AGI?不。他们构建的只是一个带有共享高分榜和出色品牌包装的并行随机搜索引擎。

To understand why, you need to understand how the gradient compression actually works — because it’s the most technically interesting part, and it’s completely separate from the intelligence problem. 要理解原因,你需要了解梯度压缩是如何实际运作的——因为这是技术上最有趣的部分,而且它与智能问题完全无关。

The Tech That Actually Works: DiLoCo and Gradient Compression

真正有效的技术:DiLoCo 与梯度压缩

Standard distributed training requires every GPU to synchronise gradients after every forward/backward pass. Every node waits for every other node. This works in a data centre on InfiniBand. It falls apart completely over the internet — latency is too high, bandwidth too variable. 标准的分布式训练要求每个 GPU 在每次前向/反向传播后同步梯度。每个节点都要等待其他所有节点。这在数据中心的 InfiniBand 网络上运行良好,但在互联网上则完全行不通——延迟太高,带宽波动太大。

DiLoCo (Decoupled Local Communication, Google DeepMind 2023) solves this differently. Instead of syncing every step, each node trains independently for many steps — called “inner steps” — then syncs once. The “delta” being sent is just the net drift: weights_after - weights_before. DiLoCo(解耦本地通信,Google DeepMind 2023)以不同的方式解决了这个问题。它不再每一步都同步,而是让每个节点独立训练多个步骤(称为“内部步骤”),然后同步一次。发送的“增量”(delta)只是净漂移:权重更新后 - 权重更新前。

Node A: train 100 steps locally → share delta Node B: train 100 steps locally → share delta Node C: train 100 steps locally → share delta ↓ average the deltas (outer step) ↓ all nodes update → repeat 节点 A:本地训练 100 步 → 分享增量 节点 B:本地训练 100 步 → 分享增量 节点 C:本地训练 100 步 → 分享增量 ↓ 对增量取平均值(外部步骤) ↓ 所有节点更新 → 重复

But even one sync of a model’s full weight delta is massive. A 500M parameter model is roughly 2GB of float32 deltas. Over the internet, per round, that’s unusable. So Hyperspace stacks two compression techniques on top: 但即使是一次模型完整权重增量的同步也是巨大的。一个 5 亿参数的模型大约有 2GB 的 float32 增量。在互联网上,每一轮同步都是不可用的。因此,Hyperspace 在此基础上叠加了两种压缩技术:

SparseLoCo — top-k sparsity. Only send the largest-magnitude weight updates. Most parameter updates are near-zero noise. The high-magnitude updates carry the actual learning signal. SparseLoCo — top-k 稀疏化。仅发送幅度最大的权重更新。大多数参数更新接近于零噪声,高幅度的更新才携带真正的学习信号。

Full delta: [0.001, -0.0003, 0.89, 0.0001, -0.76, …] Top-2% only: [ 0, 0, 0.89, 0, -0.76, …] → send as sparse {index: value} pairs 完整增量:[0.001, -0.0003, 0.89, 0.0001, -0.76, …] 仅保留前 2%:[ 0, 0, 0.89, 0, -0.76, …] → 以稀疏的 {索引: 数值} 对发送

Parcae — layer pooling. Group adjacent transformer layers into blocks of 6, average their gradients before taking top-k. Adjacent layers learn correlated things. Averaging before sparsification means a more stable top-k mask. Parcae — 层池化。将相邻的 Transformer 层每 6 层分为一组,在进行 top-k 筛选前对它们的梯度取平均值。相邻层学习的内容是相关的,在稀疏化之前取平均值意味着更稳定的 top-k 掩码。

The combined result: 195× compression. 5.5MB per round instead of roughly 1GB. 综合结果:195 倍压缩。每轮仅需 5.5MB,而不是大约 1GB。

DiLoCo: sync every N steps not every step → ~100× less frequent SparseLoCo: top-2% of delta values only → 45× smaller payload Parcae: pool layers before sparsification → 6× additional reduction Total: 195× DiLoCo:每 N 步同步一次而非每步同步 → 频率降低约 100 倍 SparseLoCo:仅保留前 2% 的增量值 → 有效载荷缩小 45 倍 Parcae:稀疏化前进行层池化 → 额外减少 6 倍 总计:195 倍

This is real and impressive. The problem is that none of it has anything to do with intelligence. It’s bandwidth optimisation. The agents communicating through this pipe are still completely amnesiac. 这确实是真实且令人印象深刻的。问题在于,这一切与智能毫无关系。这只是带宽优化。通过这个管道进行通信的智能体仍然完全处于失忆状态。

Why the Swarm Is Basic: The Architecture Problem

为什么这个群体智能很基础:架构问题

Here is the agents’ complete intelligence loop. Every agent. All 660 of them. Every one of the 27,000 experiments: 这就是智能体完整的智能循环。每一个智能体,全部 660 个,以及 27,000 次实验中的每一次:

read current leaderboard (what’s the best score?) read last 5 experiment results from shared branch prompt LLM: “given these results, generate hypothesis” run experiment record result gossip to peers goto 1 读取当前排行榜(什么是最高分?) 读取共享分支中最近 5 次实验的结果 提示 LLM:“根据这些结果,生成假设” 运行实验 记录结果 向同伴传播(Gossip) 跳转至第 1 步

The LLM’s context window is the memory. When the session resets, everything resets. There is no persistence. There is no structure. There is no causal understanding of why anything worked. LLM 的上下文窗口就是记忆。当会话重置时,一切都会重置。没有持久性,没有结构,也没有对“为什么某事有效”的因果理解。

Hyperspace stores: “run_047: threshold 0.30, score 0.67” ← flat log Hyperspace does NOT store: why threshold 0.30 worked, what it interacted with, under what conditions it holds, what failed before it. Hyperspace 存储的是:“run_047: 阈值 0.30,分数 0.67” ← 平铺日志 Hyperspace 不存储:为什么阈值 0.30 有效,它与什么交互,在什么条件下成立,以及之前什么失败了。

So when the Kaiming init “discovery” happened, here is what actually occurred: the LLM generating hypotheses was trained on He et al. 2015. The prompt included “try to improve initialization.” The model recalled Kaiming from pretraining weights. An agent ran the experiment. It worked. The score updated. 23 agents adopted it via gossip. Not emergence. Not intelligence. Retrieval from a pretrained model, dressed up as swarm discovery. 所以当 Kaiming 初始化“发现”发生时,实际情况是这样的:生成假设的 LLM 接受过何恺明 2015 年论文的训练。提示词中包含了“尝试改进初始化”。模型从预训练权重中回忆起了 Kaiming 初始化。一个智能体运行了实验,成功了,分数更新了,23 个智能体通过 Gossip 协议采用了它。这不是涌现,也不是智能。这只是从预训练模型中进行的检索,却被包装成了群体发现。

The plateau problem is the proof. Every RSI paper — Gödel Agent, Darwin Gödel Machine, Reflexion, STOP — hits the same wall: 平台期问题就是证明。每一篇递归自我改进(RSI)论文——无论是 Gödel Agent、Darwin Gödel Machine、Reflexion 还是 STOP——都撞上了同一堵墙: