One EXE. No Python. No Docker. 120 Windows automation tools written in Go.
One EXE. No Python. No Docker. 120 Windows automation tools written in Go.
一个 EXE 文件,无需 Python,无需 Docker。用 Go 编写的 120 个 Windows 自动化工具。
I built a Windows computer-use MCP server in pure Go. One EXE. No Python. No Docker. It’s a single 27 MB executable that gives local LLMs (Claude, Gemini, Cursor, Kiro, OpenCode, Ollama, you name it) the ability to actually use a Windows desktop. Think of it as giving an LLM a mouse, keyboard, eyes, and long-term memory. (Screenshots and demo clips coming soon — wanted to get the post up first.)
我用纯 Go 语言构建了一个 Windows 计算机操作 MCP 服务器。它是一个单一的 27 MB 可执行文件,无需 Python,无需 Docker。它赋予了本地大模型(Claude、Gemini、Cursor、Kiro、OpenCode、Ollama 等等)真正操作 Windows 桌面的能力。你可以把它想象成给大模型装上了鼠标、键盘、眼睛和长期记忆。(截图和演示视频即将发布——我只是想先发这篇帖子。)
Over 14,000 lines of Go, zero dependencies on OpenCV, go-ole, or any COM binding library. Almost every subsystem was implemented from scratch in pure Go instead of wrapping existing libraries. The fun part: I’m not a professional software engineer. I built this by pair-programming with multiple AI models across hundreds of iterations. And I didn’t spend a cent on API tokens. Gemini CLI (free tier), Claude Code (trial credits), Ollama (local, free), GitHub Copilot, OpenCode’s Big Pickle (200K ctx, free). The ollama launch claude trick was a workhorse — point Claude Code at a Nemotron or MiniMax3 locally and get agentic scaffolding on budget hardware.
代码超过 14,000 行 Go 代码,零依赖 OpenCV、go-ole 或任何 COM 绑定库。几乎每个子系统都是用纯 Go 从零开始实现的,而不是封装现有的库。有趣的是,我并不是一名专业的软件工程师。我是通过与多个 AI 模型进行数百次迭代的结对编程构建了它。而且我没在 API Token 上花一分钱。我使用了 Gemini CLI(免费层)、Claude Code(试用额度)、Ollama(本地免费)、GitHub Copilot 以及 OpenCode 的 Big Pickle(20 万上下文,免费)。“Ollama 启动 Claude”的技巧是主力——将 Claude Code 指向本地的 Nemotron 或 MiniMax3,就能在低成本硬件上获得智能体框架。
Why I built it: This started as “I want my AI to click a button.” It somehow turned into a Windows automation framework with vision, memory, and a training pipeline. But the real reason? A friend who’s disabled and uses Narrator as their primary computer interface. They asked for a month to test once I was ready to go public. After countless trials and Python’s “works on my machine” nonsense, I wanted something that actually ships.
我为什么构建它:最初的想法只是“我想让我的 AI 点击一个按钮”。不知怎么的,它变成了一个具备视觉、记忆和训练流水线的 Windows 自动化框架。但真正的原因是什么呢?我有一位残疾朋友,他使用“讲述人”(Narrator)作为主要的计算机交互界面。当我准备公开时,他要求测试一个月。在经历了无数次尝试和 Python 那种“在我机器上能跑”的废话后,我想要一个真正能交付使用的东西。
What it does: 120 MCP tools covering the full desktop stack: Mouse, keyboard, screenshot, OCR (native WinRT COM — 2–8x faster than PowerShell), Window management, browser automation, File Explorer control, find_image/find_all_images with triple cascade: template matching → ONNX YOLO → OCR, ocr_languages, middle mouse, horizontal scroll, fullscreen detection, SQLite memory store — AI remembers UI elements across sessions, Training pipeline — every click saves screenshot+metadata for future model fine-tuning, Adaptive engine that learns timing, success rates, and predicts next actions, Input recorder with click-vs-drag disambiguation, replays as native MCP tools. Battle-tested with: Claude Desktop, Claude Code, Kiro, Cursor, Windsurf, Gemini CLI, OpenCode, Ollama, Antigravity IDE, Cline, Android Studio, Zed, Obsidian, and more.
它能做什么:120 个 MCP 工具,涵盖了完整的桌面堆栈:鼠标、键盘、截图、OCR(原生 WinRT COM,比 PowerShell 快 2-8 倍)、窗口管理、浏览器自动化、文件资源管理器控制;具备三重级联的 find_image/find_all_images:模板匹配 → ONNX YOLO → OCR;支持 ocr_languages、中键、水平滚动、全屏检测;SQLite 内存存储——AI 可以跨会话记忆 UI 元素;训练流水线——每次点击都会保存截图和元数据,用于未来的模型微调;自适应引擎,学习时序、成功率并预测下一步操作;带有点击与拖拽区分功能的输入记录器,可作为原生 MCP 工具回放。已在以下软件中经过实战测试:Claude Desktop、Claude Code、Kiro、Cursor、Windsurf、Gemini CLI、OpenCode、Ollama、Antigravity IDE、Cline、Android Studio、Zed、Obsidian 等。
Under the hood (briefly): Raw COM/WinRT vtable dispatch — no go-ole, no CGO, no C++/WinRT. 36 COM calls through syscall.SyscallN(vtblMethod(obj, N), …) with indices hand-verified against Windows SDK headers. Hand-written NCC template matcher — brute-force O(n⁴) Pearson correlation in pure Go, no OpenCV. Cascades to ONNX YOLO, then WinRT OCR. SQLite Bayesian priors — per-class frequency distributions and spatial z-scores computed entirely in SQL. ONNX confidence adjusts based on where elements usually appear per window. No ML framework needed.
技术内幕(简述):原始 COM/WinRT vtable 分发——没有 go-ole,没有 CGO,没有 C++/WinRT。通过 syscall.SyscallN(vtblMethod(obj, N), …) 进行 36 次 COM 调用,索引均根据 Windows SDK 头文件手动验证。手写 NCC 模板匹配器——纯 Go 实现的暴力 O(n⁴) 皮尔逊相关系数计算,无需 OpenCV。级联到 ONNX YOLO,然后是 WinRT OCR。SQLite 贝叶斯先验——完全在 SQL 中计算每类的频率分布和空间 Z 分数。ONNX 置信度根据元素在窗口中通常出现的位置进行调整。无需任何机器学习框架。
Why should you care? If you’re building AI agents, this gives them hands. If you’re building desktop automation, this gives you 120 reusable tools. If you’re learning Windows internals, it shows how raw WinRT COM, OCR, ONNX, and UI automation fit together in a real project. If you’re just curious how far one person can get with modern AI tools, this is my answer.
你为什么要关注?如果你正在构建 AI 智能体,这赋予了它们双手。如果你正在构建桌面自动化,这为你提供了 120 个可重用的工具。如果你正在学习 Windows 内部原理,它展示了原始 WinRT COM、OCR、ONNX 和 UI 自动化是如何在一个真实项目中结合在一起的。如果你只是好奇一个人利用现代 AI 工具能走多远,这就是我的答案。
Security: Yes, it can control your computer. So can AutoHotkey, PowerShell, Selenium, and every RPA tool. This is local-first, every action can be logged, privacy controls toggle at runtime. Not spyware. Not a remote admin tool. Just an automation engine for your own machine.
安全性:是的,它可以控制你的计算机。AutoHotkey、PowerShell、Selenium 和所有 RPA 工具也都可以。这是本地优先的,每个操作都可以记录,隐私控制可以在运行时切换。不是间谍软件,也不是远程管理工具。它只是为你自己的机器准备的自动化引擎。
Things I’m weirdly proud of: Pure Go, raw COM/WinRT, zero CGO — no bindings, no wrappers; Hand-written NCC — no OpenCV dependency; SQLite Bayesian priors — no ML framework for the learning layer; One 27 MB EXE — no Python, Docker, or Electron; 120 MCP tools — every OS automation primitive you’d want; 36 COM vtbl call sites, 51 WinRT IIDs — all annotated, tested, cross-referenced; Built entirely on free AI tokens — you don’t need a budget to build something real; My GTX 1070 8GB handles YOLO inference fine — you don’t need a $3,000 GPU either.
我引以为傲的地方:纯 Go,原始 COM/WinRT,零 CGO——没有绑定,没有包装器;手写 NCC——无 OpenCV 依赖;SQLite 贝叶斯先验——学习层无需机器学习框架;单一 27 MB EXE——无需 Python、Docker 或 Electron;120 个 MCP 工具——你想要的每一个操作系统自动化原语;36 个 COM vtbl 调用点,51 个 WinRT IID——全部经过注释、测试和交叉引用;完全基于免费 AI Token 构建——你不需要预算也能做出实实在在的东西;我的 GTX 1070 8GB 显卡处理 YOLO 推理完全没问题——你也不需要 3000 美元的 GPU。
I’d love feedback: Especially from people into Go, MCP, local AI, computer vision, automation, or Windows internals. Open issues, suggest features, steal patterns — the repo has templates and a security policy now. Tell me what I broke, what to build next, or that I’m insane for hand-writing COM vtables in Go.
我期待反馈:特别是来自对 Go、MCP、本地 AI、计算机视觉、自动化或 Windows 内部原理感兴趣的人。欢迎提交 Issue、建议功能、借鉴模式——仓库现在有了模板和安全策略。告诉我我弄坏了什么,接下来该做什么,或者告诉我因为在 Go 中手写 COM vtable 而疯了。
AI didn’t build this project. AI became my pair programmer. The architecture, direction, debugging, testing, and endless “why doesn’t Windows do what the docs say?” moments were still mine. It convinced me that one curious computer technician, persistence, and today’s AI tools can build things I wouldn’t have been able to create on my own just a few years ago.
AI 并没有构建这个项目。AI 成为了我的结对编程伙伴。架构、方向、调试、测试以及无数次“为什么 Windows 不按文档说的做?”的时刻,依然是我自己的工作。它让我确信,一个充满好奇心的计算机技术人员,加上毅力和当今的 AI 工具,能够构建出几年前我凭一己之力无法创造的东西。
Links: GitHub: https://github.com/coff33ninja/go-mcp-computer-use Docs: docs/reference/tools.md, docs/security.md, docs/mcp-client-configs.md
链接: GitHub: https://github.com/coff33ninja/go-mcp-computer-use 文档: docs/reference/tools.md, docs/security.md, docs/mcp-client-configs.md