The last six months in LLMs in five minutes
The last six months in LLMs in five minutes
大语言模型(LLM)过去六个月的五分钟回顾
19th May 2026 2026年5月19日
I put together these annotated slides from my five minute lightning talk at PyCon US 2026, using the latest iteration of my annotated presentation tool. 我整理了这些带注释的幻灯片,内容来自我在 PyCon US 2026 上进行的五分钟闪电演讲,并使用了我最新迭代的演示文稿注释工具。
I presented this lightning talk at PyCon US 2026, attempting to summarize the last six months of developments in LLMs in five minutes. 我在 PyCon US 2026 上发表了这场闪电演讲,试图在五分钟内总结过去六个月大语言模型的发展。
Six months is a pretty convenient time period to cover, because it captures what I’ve been calling the November 2025 inflection point. November was a critical month in LLMs, especially for coding. 六个月是一个非常合适的总结周期,因为它涵盖了我所说的“2025年11月拐点”。11月是大语言模型领域至关重要的一个月,尤其是在编程方面。
For one thing, the supposedly “best” model (depending mostly on vibes) changed hands five times between the three big providers. 首先,所谓的“最强”模型(主要取决于用户的主观感受)在三大供应商之间易手了五次。
As always, I’m using my Generate an SVG of a pelican riding a bicycle test to help illustrate the differences between the models. Why this test? Because pelicans are hard to draw, bicycles are hard to draw, pelicans can’t ride bicycles… and there’s zero chance any AI lab would train a model for such a ridiculous task. 一如既往,我使用我的“生成一张鹈鹕骑自行车的 SVG”测试来展示模型之间的差异。为什么要用这个测试?因为鹈鹕难画,自行车难画,鹈鹕根本不会骑自行车……而且没有任何一家 AI 实验室会为了这种荒谬的任务去训练模型。
At the start of November the widely acknowledged “best” model was Claude Sonnet 4.5, released on 29th September. It drew me this pelican. In November it was overtaken by GPT-5.1, then Gemini 3, then GPT-5.1 Codex Max, and then Anthropic took the crown back again with Claude Opus 4.5. I think Gemini 3 drew the best pelican out of this lot, but pelicans aren’t everything. Most practitioners will agree that Opus 4.5 held the crown for the next couple of months. 11月初,公认的“最强”模型是9月29日发布的 Claude Sonnet 4.5。它画出了我展示的这只鹈鹕。11月,它被 GPT-5.1 超越,随后是 Gemini 3,接着是 GPT-5.1 Codex Max,最后 Anthropic 凭借 Claude Opus 4.5 夺回了桂冠。我认为 Gemini 3 画出的鹈鹕是这几款中最好的,但鹈鹕并不是衡量一切的标准。大多数从业者会同意,Opus 4.5 在接下来的几个月里稳坐冠军宝座。
It took a little while for this to become clear, but the real news from November was that the coding agents got good. OpenAI and Anthropic had spent most of 2025 running Reinforcement Learning from Verifiable Rewards to increase the quality of code written by their models, especially when paired up with their Codex and Claude Code agent harnesses. In November the results of this work became apparent. Coding agents went from often-work to mostly-work, crossing a quality barrier where you could use them as a daily-driver to get real work done, without needing to spend most of your time fixing their stupid mistakes. 虽然花了一段时间才显现出来,但11月真正的重磅新闻是编程智能体(Coding Agents)变得好用了。OpenAI 和 Anthropic 在2025年的大部分时间里都在进行“基于可验证奖励的强化学习”,以提高模型编写代码的质量,特别是当它们与 Codex 和 Claude Code 智能体框架结合使用时。11月,这些工作的成果显现出来。编程智能体从“偶尔能用”变成了“基本能用”,跨越了质量门槛,使你可以将其作为日常工具来完成实际工作,而无需花费大部分时间去修复它们犯下的愚蠢错误。
Also in November, this happened—the first commit to an obscure (back then) repo called “Warelay” by some guy called Pete. 同样在11月,发生了一件事——一个叫 Pete 的人向一个(当时还很冷门的)名为“Warelay”的仓库提交了第一次代码。
Over the holiday period, from December to January, a whole lot of us took advantage of the break to have a poke at these new models and coding agents and see what they could do. They could do a lot! Some of us got a little bit over-excited. I had my own short-lived bout of a form of LLM psychosis as I started spinning up wildly ambitious projects to see how far I could push them. 在12月到1月的假期期间,我们很多人利用这段空闲时间去探索这些新模型和编程智能体,看看它们能做什么。它们确实能做很多事!我们中的一些人变得有点过度兴奋。我也经历了一段短暂的“LLM 狂热期”,开始启动各种雄心勃勃的项目,看看能把它们推向什么极限。
One of my projects was a vibe-coded implementation of JavaScript in Python—a loose port of MicroQuickJS—which I called micro-javascript. You can try it out in your browser in this playground. 我的其中一个项目是用 Python 实现的 JavaScript(基于直觉编码),这是 MicroQuickJS 的一个简易移植版,我称之为 micro-javascript。你可以在这个游乐场(playground)中在浏览器里试用它。
That playground demo shows JavaScript code run using my micro-javascript library, in Python, running inside Pyodide, running in WebAssembly, running in JavaScript, running in a browser! It’s pretty cool! But did anyone out there need a buggy, slow, insecure half-baked implementation of JavaScript in Python? They did not. I have quite a few other projects from that holiday period that I have since quietly retired! 那个游乐场演示展示了 JavaScript 代码如何使用我的 micro-javascript 库,在 Python 中运行,而 Python 又在 Pyodide 中运行,Pyodide 在 WebAssembly 中运行,WebAssembly 在 JavaScript 中运行,最后在浏览器中运行!这很酷!但有人需要一个在 Python 中运行的、充满 Bug、缓慢且不安全的半成品 JavaScript 实现吗?并没有。那个假期我还有不少其他项目,后来都悄悄地放弃了!
On to February. Remember that Warelay project that had its first commit at the end of November? 时间来到2月。还记得那个在11月底首次提交的 Warelay 项目吗?
In December and January it had gone through quite a few name changes… and by February it was taking the world by storm under its final name, OpenClaw. The amount of attention it got is pretty astonishing for a project that was less than three months old. 在12月和1月,它经历了几次更名……到了2月,它以最终名称“OpenClaw”席卷全球。对于一个不到三个月的项目来说,它所获得的关注度令人震惊。
OpenClaw is a “personal AI assistant”, and we actually got a generic term for these, based on NanoClaw and ZeroClaw and suchlike… they’re called Claws. OpenClaw 是一个“个人 AI 助手”,我们实际上为这类产品找到了一个通用术语,基于 NanoClaw 和 ZeroClaw 等名称……它们被称为“Claws”(爪子)。
Mac Minis started to sell out around Silicon Valley, because people were buying them to run their Claws. Drew Breunig joked to me that this is because they’re the new digital pets, and a Mac Mini is the perfect aquarium for your Claw. 硅谷各地的 Mac Mini 开始售罄,因为人们买它们来运行自己的 Claw。Drew Breunig 开玩笑地对我说,这是因为它们是新的数字宠物,而 Mac Mini 是你 Claw 的完美水族箱。
My favourite metaphor for Claws is Alfred Molina’s Doc Ock in the 2004 movie Spider-Man 2. His claws were powered by AI, and were perfectly safe provided nothing damaged his inhibitor chip… after which they turned evil and took over. 我最喜欢用来比喻 Claw 的是2004年电影《蜘蛛侠2》中 Alfred Molina 饰演的章鱼博士。他的机械爪由 AI 驱动,只要抑制芯片没损坏,它们就非常安全……一旦损坏,它们就会变坏并接管一切。
Also in February: Gemini 3.1 Pro came out, and drew me a really good pelican riding a bicycle. Look at this! It’s even got a fish in its basket. 同样在2月:Gemini 3.1 Pro 发布了,它为我画了一只非常棒的骑自行车的鹈鹕。看这个!它的篮子里甚至还有一条鱼。
And then Google’s Jeff Dean tweeted this video of an animated pelican riding a bicycle, plus a frog on a penny-farthing and a giraffe driving a tiny car and an ostrich on roller skates and a turtle kickflipping a skateboard and a dachshund driving a stretch limousine. So maybe the AI labs have been paying attention after all! 随后,谷歌的 Jeff Dean 在推特上发布了一段动画视频,视频中不仅有骑自行车的鹈鹕,还有骑高轮自行车的青蛙、开着微型汽车的长颈鹿、穿轮滑鞋的鸵鸟、玩滑板翻转的乌龟,以及开着加长豪华轿车的腊肠犬。所以,也许 AI 实验室确实一直在关注这些!
A lot of stuff happened just in the past month. 过去一个月里发生了很多事情。
Google released the Gemma 4 series of models, which are the most capable open weight models I’ve seen from a US company. 谷歌发布了 Gemma 4 系列模型,这是我见过的来自美国公司的最强大的开放权重模型。
Also last month, Chinese AI lab GLM came out with GLM-5.1—an open weight 1.5TB monster! This is a very effective model… if you can afford the hardware to run it. 同样在上个月,中国 AI 实验室 GLM 推出了 GLM-5.1——一个 1.5TB 的开放权重巨兽!这是一个非常有效的模型……前提是你买得起运行它的硬件。
GLM-5.1 drew me this very competent pelican on a bicycle. GLM-5.1 为我画出了这只非常像样的骑自行车鹈鹕。
… though when it tried to animate it the bicycle bounced off into the top and the bicycle got warped. ……不过当它尝试制作动画时,自行车弹到了顶部,并且发生了扭曲。
Charles on Bluesky suggested I try it with a North Virginia Opossum on an E-scooter Bluesky 上的 Charles 建议我尝试画一只骑电动滑板车的北弗吉尼亚负鼠。
And it did this! I’ve tried this on other models and they don’t even come close. “Cruising the commonwealth since dusk” is perfect. It’s animated too. 它做到了!我在其他模型上试过,效果远不如它。“自黄昏起巡游联邦”这句话太完美了。而且它还是动画的。
The other neat Chinese open weight models in April came from Qwen. Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7. That’s a 20.9GB open weights model that runs on my laptop! (I think this mainly demonstrates that the pelican on the bicycle has firmly exceeded its limits as a useful benchmark.) 四月份其他出色的中国开放权重模型来自 Qwen。我笔记本电脑上的 Qwen3.6-35B-A3B 画出的鹈鹕比 Claude Opus 4.7 还要好。那是一个可以在我笔记本电脑上运行的 20.9GB 开放权重模型!(我认为这主要说明,“骑自行车的鹈鹕”作为基准测试已经彻底超出了它的实用极限。)
Here’s that Claude Sonnet 4.5 pelican from September for comparison. 这是9月份那只 Claude Sonnet 4.5 画的鹈鹕,供大家对比。
So those were the two main themes of the past 以上就是过去两个主要的主题。