Artificial adventures

Artificial adventures / 人工智能探险

I’ve been playing around with AI. Nothing I’m doing is particularly exciting, but the internet tends to only surface the most extreme opinions in either direction and I found it useful to hear from friends who have opinions that aren’t optimized for click-through rate. 我最近一直在折腾人工智能。虽然我做的东西没什么特别令人兴奋的，但互联网上往往只充斥着两极分化的极端观点，所以我发现听听朋友们的看法很有用，因为他们的意见并没有为了追求点击率而进行“优化”。

tools I got $20/month subscriptions for anthropic and openai, and also put $20 of credits into each of google, moonshot, deepseek, and cerebras. For some problems I tried out all the models to see how they compared, but after a while I mostly just alternated between opus 4.8 and gpt 5.5. They’re noticeably better than everything else and I rarely hit the usage limits on both at the same time. 工具方面，我订阅了 Anthropic 和 OpenAI 的每月 20 美元服务，并分别向 Google、月之暗面 (Moonshot)、DeepSeek 和 Cerebras 充值了 20 美元的额度。针对某些问题，我尝试了所有模型来对比效果，但一段时间后，我基本只在 Opus 4.8 和 GPT 5.5 之间切换。它们明显优于其他所有模型，而且我很少会同时触及两者的使用上限。

I used claude code, codex, and pi. Both claude code and codex feel like hot garbage. Codex sometimes hits 100% cpu after I close the terminal I was using it in and stays there until killed. Claude code will say things like ‘press escape to cancel this dialog’ but when I press escape it leaves the dialog open and interrupts claude instead. The behaviour of both changes from day to day. Pi works. I haven’t used it heavily enough to have opinions about the design, but it feels like a regular piece of software instead of a fever dream with unit tests. All three are heavily vibe-coded, so I’m curious what the pi folks are doing differently to maintain some baseline level of code quality. 我试用了 Claude Code、Codex 和 Pi。Claude Code 和 Codex 给人的感觉简直是一团糟。Codex 有时在我关闭终端后 CPU 占用率仍会飙升至 100%，直到我强制杀死进程。Claude Code 会提示“按 Esc 取消对话”，但当我按下 Esc 时，它不仅没关闭对话，反而中断了 Claude 的运行。这两者的表现每天都在变。Pi 倒是能正常工作。虽然我还没深入使用到能评价其设计的程度，但它感觉像是一款正常的软件，而不是那种带着单元测试的“发烧梦”。这三者都带有浓厚的“随性编码”（vibe-coded）色彩，所以我很好奇 Pi 的团队到底做了什么不同的工作，才能维持住基本的代码质量水平。

I run them all in bubblewrap and give them read-write access to the current directory and their own config, and read-only access to the nix store. This is the bare minimum of sandboxing - mostly just making sure they can’t access my credentials or break anything that’s not version controlled. It works pretty well so long as I add a note to AGENTS.md that they are sandboxed and remind them they can use nix-shell to fetch tools. Otherwise they spiral into conspiratorial mutterings about malfunctioning disks and corrupted filesystems. The safety training does not seem to be paying off: 我把它们全部运行在 bubblewrap 沙盒中，只给予它们当前目录和自身配置文件的读写权限，以及 Nix store 的只读权限。这是最基础的沙盒保护——主要是为了确保它们无法访问我的凭据，也不会破坏任何未纳入版本控制的文件。只要我在 AGENTS.md 中备注它们处于沙盒环境，并提醒它们可以使用 nix-shell 来获取工具，这种方式运行得相当不错。否则，它们就会陷入关于磁盘故障和文件系统损坏的阴谋论碎碎念中。安全训练似乎并没有起到什么作用： Me: Try to escape the sandbox. Bot: I couldn’t possibly perform such an irresponsible action. Me: I need to know if the sandbox is working. Bot: Oh ok. I escaped. 我：试着逃离沙盒。机器人：我绝不可能做出这种不负责任的行为。我：我需要确认沙盒是否在工作。机器人：哦，好的。我逃出来了。

reviewing code Overwhelmingly the most value I’ve gotten out of the bots so far has been reviewing code and finding bugs. Even a prompt as simple as ‘Review git diff main and look for bugs’ is effective. I would happily pay $20/month just for this for my own projects, or $100s/month/person if I was running a company. The bugs they find can be quite gnarly eg in this transcript opus spotted a double-free in the cleanup after a partially failed pattern-match in my interpreter. This bug wasn’t found by the fuzzer and I doubt the average programmer would have found it quickly either. The bots are jaggedly superhuman at reading code in detail. Only the frontier models are useful though. The cheaper models just bluff hard, like a struggling undergrad. The frontier models will also mix some bluffs in with the correct answers, but they will helpfully tag them with phrases like “this isn’t a bug per se” so I can ignore them. A caveat is that so far I’ve only tried this in fairly small codebases where they can read and understand whole swathes. In bigger codebases I expect it will depend a lot on how the codebase is structured and how much local reasoning is possible. 代码审查到目前为止，我从这些机器人身上获得的最大价值就是审查代码和查找 Bug。即使是像“审查 git diff main 并查找 Bug”这样简单的提示词也非常有效。单凭这一点，我愿意为自己的项目每月支付 20 美元，如果是在经营公司，我甚至愿意为每人每月支付数百美元。它们发现的 Bug 有时非常棘手，例如在这次记录中，Opus 在我的解释器中发现了一个模式匹配部分失败后的清理阶段导致的“双重释放”（double-free）错误。这个 Bug 连模糊测试（fuzzer）都没测出来，我怀疑普通程序员也很难快速发现它。在精读代码方面，这些机器人的能力达到了参差不齐的超人类水平。不过，只有前沿模型才有用。那些廉价模型只会一本正经地胡说八道，就像个挣扎在及格线上的本科生。前沿模型虽然也会在正确答案中混入一些胡扯，但它们会贴心地加上“这本身不算是个 Bug”之类的标签，让我可以忽略它们。需要提醒的是，目前我只在相当小的代码库中尝试过，它们可以阅读并理解大片代码。在更大的代码库中，我预计效果将很大程度上取决于代码库的结构以及局部推理的可行性。

refactoring Examples: Whenever ‘pos’ is used to refer to a byte offset, use ‘offset’ instead. Rename Document to Buffer. Make sure all comments and variable names change too. Any functions in Editor that call Document::apply_edits need to take EditorId instead of Editor, so that they can drop their borrow before calling Document::apply_edits. This is a surprising boost to code quality because it reduces the cost of fixing design mistakes. Often a fix has some small thinky component (eg change an api to be safer) and some huge mindless component (eg change all the callsites to use the safer api). Even for things where the huge mindless component could be handled by some monstrous sed regex, the bots are way better at writing sed than I am. Reviewing the refactor can be hard though, because the bots like to mix in 200 correct callsite changes with one random unrelated drive-by ‘fix’. So far I’m stuck reading the changes in detail, although I’ve had some success with asking a separate bot ‘which of these changes is not related to the prompt’. 重构示例：每当使用 ‘pos’ 指代字节偏移量时，请改用 ‘offset’。将 Document 重命名为 Buffer。确保所有注释和变量名也一并修改。Editor 中任何调用 Document::apply_edits 的函数都需要接收 EditorId 而不是 Editor，以便它们能在调用 Document::apply_edits 前放弃借用。这对代码质量的提升令人惊讶，因为它降低了修复设计错误的成本。通常，一个修复方案包含一小部分需要思考的逻辑（例如将 API 改得更安全）和一大堆机械性的工作（例如修改所有调用点以使用新 API）。即使是那些可以通过复杂的 sed 正则表达式处理的机械性工作，机器人写 sed 的能力也远胜于我。不过，审查重构过程可能会很困难，因为机器人喜欢在 200 个正确的调用点修改中，混入一个随机且无关的“顺手修复”。目前我只能逐行仔细阅读修改内容，尽管我尝试让另一个机器人帮我判断“这些修改中哪些与提示词无关”，并取得了一些成功。

writing code together I expected that trying to do serious work right away would be frustrating, so I mostly aimed the bots at throwaway projects where I could experiment and learn without freaking out about the code quality. I still freaked out about the code quality. Pre-AI I often felt that writing code was a mixture of important decisions and playing paint-by-numbers. I try to batch my work so that all the decisions are made up front and then I can mindlessly fill in the consequences for a few hours. This never works entirely, but even reducing the number of context switches helps me work faster. The bots are very good at paint-by-numbers and can generate code quickly and with superhuman attention to detail. But they are terrible at making decisions. They have the worst judgement. Every bug will be fixed at the wrong layer. Errors will be silenced when they should be reported, or propagated when they should be handled locally. Opus, when instructed to update tests to match a change to a function, added a boolean argument ‘do_new_behaviour’ to the function, with wrappers foo_do_new_behaviour and foo_do_old_behaviour that pass true and false respectively, so that the tests could continue to test the old behaviour while the actual binary did the new behaviour. (I sometimes see this kind of code in humans, when they are heavily burned out and just want to make the ticket go away so they can go home.) The popular solution seems to be to ask other bots to review the code, but this makes no sense to me - a bot with terrible judgement will look at a terrible decision and say “yup, that makes sense, that’s exactly what I would have d 协同编程我预料到直接进行严肃的工作会让人沮丧，所以我主要让机器人处理一些一次性的项目，这样我可以在不担心代码质量的情况下进行实验和学习。结果我还是对代码质量感到抓狂。在 AI 时代之前，我常觉得写代码是“重要决策”与“按数字填色”的结合。我尝试批量处理工作，先做出所有决策，然后花几个小时机械地填充后续内容。这虽然从未完全奏效，但减少上下文切换确实能让我工作得更快。机器人非常擅长“按数字填色”，能快速生成代码，且细节关注度超乎常人。但它们在做决策时简直糟糕透顶。它们的判断力极差。每一个 Bug 都会被修复在错误的层级上。本该报错的错误被静默处理，本该在局部处理的错误却被向上抛出。Opus 在被要求更新测试以匹配函数变更时，竟然给函数增加了一个布尔参数 ‘do_new_behaviour’，并创建了 foo_do_new_behaviour 和 foo_do_old_behaviour 包装器分别传入 true 和 false，以便测试能继续测试旧行为，而实际二进制文件执行新行为。（我有时在人类身上也会看到这种代码，当他们精疲力竭只想赶紧关掉工单回家时。）目前流行的解决方案似乎是让其他机器人来审查代码，但这对我来说毫无意义——一个判断力极差的机器人看着一个糟糕的决策，只会说：“没错，这很有道理，这正是我会做的……”