Five overlooked packages running my AI directory stack
Five overlooked packages running my AI directory stack
支撑我 AI 目录站点的五个被低估的软件包
The interesting parts of a project are not always the AI model or the hosting platform. This week I spent time reading source code for five dependencies that sit quietly in my package.json files. None of them are trending. All of them are load-bearing. My stack is Astro 5 SSG + Turso libSQL + GitHub Actions cron + Claude Haiku 4.5. Three sites: Top AI Tools, Find Games Like, Open Alternative To. Seven weeks in, still under 400 total pageviews, but the infrastructure is solid enough that I can focus on content rather than firefighting.
一个项目的核心亮点并不总是 AI 模型或托管平台。本周,我花时间阅读了五个静静躺在我的 package.json 文件中的依赖项的源代码。它们都不是什么热门项目,但每一个都是不可或缺的基石。我的技术栈是 Astro 5 SSG + Turso libSQL + GitHub Actions cron + Claude Haiku 4.5。目前运营着三个站点:Top AI Tools、Find Games Like 和 Open Alternative To。项目启动七周以来,总浏览量虽然不到 400 次,但基础设施已经足够稳固,让我可以专注于内容创作,而不是疲于应对各种技术故障。
tsx — TypeScript without the build ceremony
tsx — 无需繁琐构建流程的 TypeScript 运行工具
tsx by Hiroki Osame is how I run every ETL script in the monorepo. The command tsx src/etl/run.ts just works — no tsconfig fiddling, no ts-node —esm flags, no separate compile step. Under the hood it uses esbuild, which means startup is fast enough that a five-second cron warm-up doesn’t matter. What surprised me when I read the repo: tsx strips types with esbuild rather than the TypeScript compiler, so it doesn’t type-check. That’s intentional. For ETL scripts where I want pnpm typecheck to catch structural errors at CI time but not slow down the hot path, this is exactly the right tradeoff. The README calls this out clearly. I wish I’d read it three weeks ago instead of assuming tsx did full type checking.
Hiroki Osame 开发的 tsx 是我运行 monorepo 中所有 ETL 脚本的利器。执行 tsx src/etl/run.ts 命令即可直接运行,无需折腾 tsconfig,无需配置 ts-node --esm 标志,也无需单独的编译步骤。它底层使用 esbuild,启动速度极快,五秒钟的 cron 预热时间完全可以忽略不计。阅读仓库源码时我感到惊讶的是:tsx 是通过 esbuild 去除类型,而不是使用 TypeScript 编译器,因此它不会进行类型检查。这是有意为之的。对于 ETL 脚本,我希望在 CI 阶段通过 pnpm typecheck 来捕获结构性错误,同时又不拖慢热路径的执行速度,这正是最合适的权衡。README 中对此有明确说明。我真希望三周前就读过它,而不是想当然地认为 tsx 会进行完整的类型检查。
Pagefind — static full-text search with no server
Pagefind — 无需服务器的静态全文搜索
Pagefind runs as my postbuild step: pagefind --site dist --output-subdir _pagefind. It crawls the built HTML, creates a compressed WASM index, and the client-side JS loads only the chunk it needs per query. The result is search that works on a static Vercel or Cloudflare Pages deploy with zero additional infrastructure. I read through the index format docs this week. The segment files are stored as zstd-compressed binary blobs, and the JS client fetches them lazily based on the query prefix. For three sites each under 2,000 pages, the index stays under 500 KB total. The PageFind UI component is optional — I replaced it with a plain <input> that calls the JS API directly so I could control the result rendering in Astro components.
Pagefind 作为我的构建后步骤运行:pagefind --site dist --output-subdir _pagefind。它会抓取构建好的 HTML,创建一个压缩的 WASM 索引,客户端 JS 仅在查询时加载所需的块。其结果是,在 Vercel 或 Cloudflare Pages 等静态部署环境下,无需任何额外基础设施即可实现搜索功能。本周我阅读了索引格式文档,分段文件以 zstd 压缩的二进制 blob 形式存储,JS 客户端根据查询前缀进行懒加载。对于三个页面数均在 2,000 以内的站点,总索引大小不到 500 KB。PageFind 的 UI 组件是可选的——我将其替换为一个简单的 <input> 标签,直接调用 JS API,这样我就可以在 Astro 组件中自定义搜索结果的渲染方式。
Crawlee — TypeScript scraping with built-in queue management
Crawlee — 内置队列管理的 TypeScript 爬虫工具
I haven’t shipped Crawlee yet, but it’s been on my bookmarks list since I started building the itch.io ETL. My current approach is fetch + manual parsing, which works for known endpoints. Crawlee adds request queue persistence, rate limiting, and a cheerio integration for HTML extraction, all in TypeScript with native ESM support. The reason I haven’t switched: my ETL runs inside GitHub Actions where I want simple, auditable scripts over a full crawl framework. But if I start scraping product pages from sites that don’t have APIs — which is the next natural expansion for the OSS alternatives directory — Crawlee is the tool I’d reach for. The Apify team maintains it actively and the TypeScript types are genuinely good.
我还没有正式上线 Crawlee,但自从我开始构建 itch.io 的 ETL 以来,它就一直在我的书签列表中。我目前的方法是 fetch + 手动解析,这对于已知的端点很有效。Crawlee 增加了请求队列持久化、速率限制以及用于 HTML 提取的 cheerio 集成,所有这些都基于 TypeScript 并原生支持 ESM。我还没切换过去的原因是:我的 ETL 在 GitHub Actions 中运行,我更倾向于简单、可审计的脚本,而不是庞大的爬虫框架。但如果我开始抓取那些没有 API 的网站的产品页面(这是 OSS 替代品目录的下一个自然扩展方向),Crawlee 将是我的首选工具。Apify 团队对其维护非常积极,而且它的 TypeScript 类型定义确实非常出色。
eemeli/yaml — small footprint, strict spec compliance
eemeli/yaml — 轻量级且严格符合规范
The yaml package by Eemeli Aro parses the frontmatter in my article files before cross-posting to Dev.to and Hashnode. It’s 35 KB minified, has zero dependencies, and handles multi-line strings and nested objects without surprises. I switched from js-yaml six weeks ago because eemeli/yaml has better ESM exports and the parse errors are more actionable when frontmatter has a typo. One thing I didn’t know until this week: the yaml package can also stringify back to YAML, preserving comments. I don’t use that feature yet, but it matters for a workflow where I want to programmatically update article frontmatter without clobbering the human-readable structure. That’s on the roadmap for automating canonical_url injection after Dev.to publish.
Eemeli Aro 开发的 yaml 包负责解析我文章文件中的 frontmatter,以便同步发布到 Dev.to 和 Hashnode。它压缩后仅 35 KB,零依赖,处理多行字符串和嵌套对象时非常稳定。六周前我从 js-yaml 切换过来,因为 eemeli/yaml 拥有更好的 ESM 导出支持,且在 frontmatter 出现拼写错误时,解析错误提示更具可操作性。直到本周我才知道:yaml 包还可以将对象序列化回 YAML,并保留注释。虽然我还没用到这个功能,但对于我想要以编程方式更新文章 frontmatter 同时又不破坏人类可读结构的工作流来说,这非常重要。这已经在我的路线图中,用于实现 Dev.to 发布后的 canonical_url 自动注入。
@libsql/client — batched writes are the underrated feature
@libsql/client — 批处理写入是被低估的功能
The @libsql/client TypeScript client is what connects my ETL scripts to Turso. I wrote about Turso vs Cloudflare D1 earlier this week, but I didn’t cover the batch API, which is the feature I actually rely on most. A single db.batch([...]) call wraps multiple INSERT OR REPLACE statements in one network round trip, which matters when seeding a 500-row table from a GitHub Actions runner. The client supports both remote Turso connections and an embedded file: mode that runs libSQL in-process with no network. I use the in-process mode for local ETL development so I don’t burn Turso API quota while iterating on the seed logic. Switching between modes is one environment variable. That’s the kind of DX detail that makes a dependency feel considered rather than assembled.
@libsql/client TypeScript 客户端是我将 ETL 脚本连接到 Turso 的桥梁。本周早些时候我写过关于 Turso 与 Cloudflare D1 的对比,但没有提到批处理 API,而这正是我最依赖的功能。单次 db.batch([...]) 调用可以将多个 INSERT OR REPLACE 语句封装在一次网络往返中,这在从 GitHub Actions runner 为 500 行的表填充数据时至关重要。该客户端既支持远程 Turso 连接,也支持嵌入式的 file: 模式(在进程内运行 libSQL,无需网络)。我在本地 ETL 开发中使用进程内模式,这样在迭代填充逻辑时就不会消耗 Turso 的 API 配额。只需一个环境变量即可在两种模式间切换。这种开发者体验(DX)细节让一个依赖项显得经过深思熟虑,而非随意拼凑。
None of these packages announced anything dramatic this week. They’re just the boring infrastructure that lets the AI parts of the stack do their job. I’ll write up actual traffic and content metrics in 30 days when I have a month of data worth publishing. Part of an ongoing 6-month experiment running three AI-curated directory sites. The technical claims here are real; this article was AI-assisted.
这些软件包本周都没有发布什么惊天动地的消息。它们只是枯燥的基础设施,让技术栈中的 AI 部分能够各司其职。等我有了一个月值得发布的数据后,我会在 30 天后写出实际的流量和内容指标。这是我正在进行的为期 6 个月的实验的一部分,旨在运营三个由 AI 策划的目录网站。文中的技术陈述均属实;本文由 AI 辅助撰写。