Kimi K2.6 just beat Claude, GPT-5.5, and Gemini in a coding challenge
Kimi K2.6 just beat Claude, GPT-5.5, and Gemini in a coding challenge
Kimi K2.6 在编程挑战中击败了 Claude、GPT-5.5 和 Gemini
An open-weights Chinese model just beat Claude, GPT-5.5, and Gemini in a programming challenge. 一款中国开源权重模型在编程挑战中击败了 Claude、GPT-5.5 和 Gemini。
I’m running the ongoing AI Coding Contest where I pit major language models against each other in real-time programming tasks with objective scoring. Day 12 was the Word Gem Puzzle. Ten models entered. The results were not what most people would have predicted. 我正在举办一场持续进行的 AI 编程竞赛,让各大语言模型在实时编程任务中进行对抗,并进行客观评分。第 12 天的比赛项目是“单词宝石拼图”(Word Gem Puzzle)。共有十个模型参赛,结果出乎大多数人的预料。
Kimi K2.6, an open-weights model from Chinese startup Moonshot AI, won the challenge outright: 22 match points, 7-1-0. MiMo V2-Pro from Xiaomi came second. GPT-5.5 was third. Claude Opus 4.7 finished fifth. Every model from the Western frontier labs landed below the top two. 来自中国初创公司月之暗面(Moonshot AI)的开源权重模型 Kimi K2.6 赢得了比赛:获得 22 个比赛积分,战绩为 7 胜 1 平 0 负。小米的 MiMo V2-Pro 位列第二,GPT-5.5 第三,Claude Opus 4.7 排名第五。所有来自西方前沿实验室的模型均未进入前两名。
The challenge
挑战内容
The Word Gem Puzzle is a sliding-tile letter puzzle. The board is a rectangular grid (10×10, 15×15, 20×20, 25×25, or 30×30) filled with letter tiles and one blank space. Bots can slide any adjacent tile into the blank and at any point claim valid English words formed in straight horizontal or vertical lines. Diagonals don’t count. Backwards doesn’t count. “单词宝石拼图”是一款滑动字母拼图游戏。棋盘是一个矩形网格(10×10、15×15、20×20、25×25 或 30×30),填满了字母方块并留有一个空格。机器人可以将任何相邻的方块滑入空格,并随时声明在水平或垂直直线上形成的有效英语单词。对角线不算,反向拼写也不算。
The scoring rewards longer words and punishes short ones. Words under seven letters cost points: a five-letter word loses you one point, a three-letter word costs three. Seven letters or more score their length minus six, so an eight-letter word is worth two points. The same word can only be claimed once; if another bot gets there first, you get nothing. 评分规则奖励长单词,惩罚短单词。七个字母以下的单词会扣分:五个字母的单词扣一分,三个字母的单词扣三分。七个字母及以上的单词得分为“长度减六”,因此八个字母的单词得两分。同一个单词只能被声明一次;如果其他机器人先抢到,你就得不到分。
Each pair of models played five rounds, one per grid size, with a ten-second wall-clock limit per round. The grids are seeded with real dictionary words in a crossword-style layout, then the remaining cells are filled with letters weighted by Scrabble tile frequencies, and finally the blank is scrambled, more aggressively on larger boards. On a 10×10, many seed words survive intact. On a 30×30, almost none do. That turns out to matter a lot. 每对模型进行五轮比赛,每种网格尺寸一轮,每轮限时 10 秒。网格以填字游戏风格预置真实的词典单词,剩余单元格根据拼字游戏(Scrabble)的字母频率填充,最后对空格进行打乱,网格越大打乱越剧烈。在 10×10 网格中,许多预置单词能保持完整;而在 30×30 网格中,几乎没有单词能幸存。事实证明,这一点至关重要。
The Results
比赛结果
The code produced by Nvidia’s Nemotron Super 3 contained a syntax error, so it never connected to the game server. Nine models actually competed. 英伟达 Nemotron Super 3 生成的代码包含语法错误,因此未能连接到游戏服务器。实际参赛的模型共有九个。
| Rank | Model | Match Points | Record |
|---|---|---|---|
| 1 | Kimi K2.6 | 22 | 7-1-0 |
| 2 | MiMo V2-Pro | 20 | 6-2-0 |
| 3 | ChatGPT GPT-5.5 | 16 | 5-1-2 |
| 4 | GLM 5.1 | 15 | 5-0-3 |
| 5 | Claude Opus 4.7 | 12 | 4-0-4 |
| 6 | Gemini Pro 3.1 | 9 | 3-0-5 |
| 7 | Grok Expert 4.2 | 9 | 3-0-5 |
| 8 | DeepSeek V4 | 3 | 1-0-7 |
| 9 | Muse Spark | 0 | 0-0-8 |
Kimi K2.6 is open-weights, publicly available from Moonshot AI, a Chinese startup founded in 2023. MiMo V2-Pro is currently API-only; the tweet linked here is Xiaomi confirming that weights for their newer V2.5 Pro model are dropping soon. The models from Anthropic, OpenAI, Google, and xAI placed third through seventh. GLM 5.1, from Chinese lab Zhipu AI, placed fourth. DeepSeek finished eighth. This isn’t a clean China-beats-West story; it’s two specific models that won. Kimi K2.6 是由 2023 年成立的中国初创公司月之暗面推出的开源权重模型。MiMo V2-Pro 目前仅提供 API;小米在推文中确认其更新的 V2.5 Pro 模型权重即将发布。Anthropic、OpenAI、Google 和 xAI 的模型分列第三至第七名。来自中国智谱 AI 的 GLM 5.1 排名第四。DeepSeek 排名第八。这并非简单的“中国击败西方”的故事,而是两个特定模型赢得了比赛。
What I saw
我的观察
The move logs tell the story. Kimi won by sliding aggressively. Its approach was greedy: score each possible move by what new positive-value words it unlocks, execute the best one, repeat. When no move unlocked a positive word, it fell back to the first legal direction alphabetically. This caused some inefficient edge-oscillation, a 2-cycle pattern where the bot bounced the blank back and forth without progress. 移动日志揭示了原因。Kimi 的获胜得益于积极的滑动策略。它的方法是贪婪的:通过评估每个可能的移动能解锁哪些新的正分单词来打分,执行最优解,然后重复。当没有移动能解锁正分单词时,它会退回到按字母顺序排列的第一个合法方向。这导致了一些低效的边缘震荡,即机器人来回移动空格却毫无进展的 2 循环模式。
On smaller grids where seed words were still largely intact, that hurt. On the 30×30 grids, where the scramble had broken up nearly everything and reconstruction was the only path to points, the sheer slide volume eventually paid off. Kimi’s cumulative score of 77 was the highest in the tournament. 在预置单词基本完整的较小网格中,这种策略造成了损失。但在 30×30 的网格中,由于打乱破坏了几乎所有内容,重建是得分的唯一途径,巨大的滑动量最终带来了回报。Kimi 的累计得分为 77 分,是比赛中最高的。
MiMo’s sliding code exists in the repo, but its “best value greater than zero” threshold never triggered, so in practice it never slid once. It went straight to scanning the initial grid for words of seven letters or more and blasted all its claims in a single TCP packet. Brittle strategy: entirely dependent on the scramble leaving intact seed words. On grids where words survived, MiMo cleaned up fast. On grids where they didn’t, it scored nothing. Final tally: 43 cumulative points, second place. MiMo 的滑动代码存在于仓库中,但其“最优值大于零”的阈值从未触发,因此实际上它一次都没有滑动。它直接扫描初始网格寻找七个字母及以上的单词,并在一个 TCP 数据包中发送了所有声明。这种策略非常脆弱:完全依赖于打乱后留下的完整预置单词。在单词幸存的网格中,MiMo 迅速获胜;而在没有幸存单词的网格中,它一分未得。最终累计得分 43 分,位列第二。
Claude also didn’t slide. The move logs show it holding up well on 25×25 boards where scramble density was still manageable, then falling apart on 30×30 where actual tile movement was needed. Not sliding is a real limitation in a puzzle built around sliding. Claude 也没有滑动。移动日志显示,它在打乱密度尚可控的 25×25 网格上表现良好,但在需要实际移动方块的 30×30 网格上表现糟糕。在围绕滑动设计的拼图中,不滑动是一个真正的局限。
GPT-5.5 was more conservative, roughly 120 slides per round with a cap to avoid thrashing, and showed the strongest numbers on 15×15 and 30×30 grids. Grok never slid either, yet scored reasonably on the larger boards. GLM was the most aggressive slider in the whole tournament, over 800,000 total slides, but stalled badly whenever it ran out of positive moves. GPT-5.5 更加保守,每轮大约滑动 120 次(设有上限以避免无效震荡),在 15×15 和 30×30 网格上表现最强。Grok 也没有滑动,但在较大网格上得分尚可。GLM 是整个比赛中滑动最积极的模型,总滑动次数超过 80 万次,但每当没有正分移动时就会陷入停滞。
DeepSeek sent malformed data every round. Zero useful output. At least it didn’t make things worse by playing. Muse made things worse by playing. The scoring penalizes short words: three-letter words cost three points, four-letter words cost two, five-letter words cost one. The intent is to stop bots from carpet-bombing the board with “the” and “and” and “it.” Every serious competitor filtered their dictionary to words of seven letters or more. Muse claimed everything. Every word it could find, regardless of length, fired off as a claim. On a 30×30 grid with hundreds of short valid words visible at any moment, Muse found them all and claimed every one. Its cumulative score was −15,309. It lost all eight matches and won zero rounds. DeepSeek 每轮都发送格式错误的数据,没有输出任何有用的结果。至少它没有因为参与而让情况变得更糟。而 Muse 则因为参与让情况变得更糟。评分规则惩罚短单词:三个字母扣三分,四个字母扣两分,五个字母扣一分。其目的是防止机器人用“the”、“and”和“it”等词轰炸棋盘。所有认真的参赛者都将词典过滤为七个字母及以上的单词。Muse 则声明了所有它能找到的单词,无论长度如何。在 30×30 的网格中,随时可见数百个短单词,Muse 找到了它们并全部声明。其累计得分为 -15,309 分。它输掉了所有八场比赛,没有赢过一轮。