It Worked on My Machine (Literally)
It Worked on My Machine (Literally)
在我的机器上运行成功(字面意思)
I have a TRMNL on my desk. If you haven’t seen one, it’s a little e-ink display from trmnl.com that shows you whatever you tell it to: your calendar, the weather (but in Haiku form), a far side comic, a random Studio Ghibli picture. The whole device runs on plugins, and the nice thing is you can write your own. 我的桌上放着一台 TRMNL。如果你还没见过,它是由 trmnl.com 出品的一款小型电子墨水屏,可以显示你指定的任何内容:你的日历、天气(以俳句形式呈现)、《远方》(The Far Side)漫画,或者一张随机的吉卜力工作室图片。整个设备通过插件运行,最棒的是你可以编写自己的插件。
I’d been meaning to build a TRMNL plugin for a while, and I finally landed on an idea that was small enough to actually finish: show what I’m currently reading on StoryGraph. Just three things, really. My profile name, what I’m currently reading, and the next couple of books in my to-read pile. That’s it. A small project. I even said the words “basic, simple plugin” out loud, which in hindsight was me daring the universe. 我一直想开发一个 TRMNL 插件,最终我确定了一个足够小、能够真正完成的想法:显示我在 StoryGraph 上正在阅读的书籍。其实就三件事:我的个人资料名称、我正在读的书,以及待读清单里的几本书。仅此而已。这是一个小项目。我甚至大声说了句“这是一个基础、简单的插件”,现在回想起来,我当时简直是在向宇宙发起挑战。
The plan: TRMNL plugins can fetch their data a few different ways. The one that fit was polling: TRMNL pings a URL on a schedule, gets back some JSON, and renders it with a Liquid template. So I needed a small server that returns my reading data as JSON, plus the templates to lay it out on the screen. 计划是这样的:TRMNL 插件可以通过几种不同的方式获取数据。最适合的方法是轮询:TRMNL 按计划访问一个 URL,获取返回的 JSON 数据,并使用 Liquid 模板进行渲染。所以我需要一个小服务器来以 JSON 格式返回我的阅读数据,并提供用于在屏幕上布局的模板。
The catch: StoryGraph doesn’t have a public API. No tidy endpoint to call. If I wanted the data, I’d have to scrape it off my public profile page. I found a reference project, storygraph-api, that does exactly this, and it gave me the lay of the land: the URLs to hit (/currently-reading/username, /to-read/username) and the HTML structure of a book on the page.
问题在于:StoryGraph 没有公开的 API。没有现成的接口可以调用。如果我想获取数据,就必须从我的公开个人资料页面进行抓取。我找到了一个参考项目 storygraph-api,它正是这样做的,这让我摸清了门路:需要访问的 URL(/currently-reading/username,/to-read/username)以及页面上书籍的 HTML 结构。
I wanted to keep this lightweight. Plain Ruby where I could, a real framework only if I needed one. For a service with two or three JSON routes, plain Ruby plus Rack is plenty. No Rails, no Hanami, just a Rack app and Nokogiri to parse the HTML. Easy. 我想保持轻量化。能用原生 Ruby 就用原生 Ruby,只有在必要时才使用成熟的框架。对于一个只有两三个 JSON 路由的服务来说,原生 Ruby 加上 Rack 就足够了。不需要 Rails,不需要 Hanami,只需要一个 Rack 应用和用于解析 HTML 的 Nokogiri。很简单。
The first wall: Before writing a line of application code, I did the one thing I always tell other people to do: I tested the riskiest assumption first. Could I even fetch a StoryGraph page? 第一道墙:在写任何一行应用代码之前,我做了我总是建议别人做的一件事:先测试风险最高的假设。我真的能抓取到 StoryGraph 页面吗?
$ curl https://app.thestorygraph.com/profile/christine_s
HTTP 403
Hm. I added a browser User-Agent. Still 403. I added the full set of Chrome headers, the sec-ch-ua bits, a cookie jar, all of it. Still 403. Then I looked at the response headers and saw the actual story: cf-mitigated: challenge, server: cloudflare.
嗯。我添加了浏览器的 User-Agent。依然是 403。我添加了整套 Chrome 请求头、sec-ch-ua 信息、cookie jar,所有能加的都加了。还是 403。然后我查看了响应头,看到了真相:cf-mitigated: challenge,server: cloudflare。
StoryGraph sits behind a Cloudflare managed challenge. My polite little curl request was getting waved off at the door before it ever reached their servers. And here’s the part that surprised me: it wasn’t about the headers at all. Cloudflare was fingerprinting the TLS handshake itself. Real browsers negotiate TLS in a particular, recognizable way (the cipher order, the extensions, the whole shape of the “hello”), and curl does it differently. You can spoof every header in the world and you’ll still look like a robot, because the give-away happens one layer down, before any headers are sent. StoryGraph 部署在 Cloudflare 的挑战模式之后。我那礼貌的 curl 请求在到达服务器之前就被拒之门外了。让我惊讶的是:这根本不是请求头的问题。Cloudflare 在对 TLS 握手本身进行指纹识别。真正的浏览器以一种特定的、可识别的方式协商 TLS(加密套件顺序、扩展、整个“hello”包的形状),而 curl 的做法不同。你可以伪造世界上所有的请求头,但你看起来仍然像个机器人,因为破绽发生在更底层,在任何请求头发送之前。
The thing that actually worked: The fix turned out to be a tool I’d never had a reason to use before: curl-impersonate. It’s curl rebuilt to mimic a real browser’s TLS fingerprint exactly. Same ciphers, same curves, same handshake shape as Chrome.
真正奏效的方法:解决方案是一个我以前从未用过的工具:curl-impersonate。它是重新构建的 curl,旨在精确模拟真实浏览器的 TLS 指纹。与 Chrome 相同的加密套件、相同的曲线、相同的握手形状。
$ curl_chrome136 -s -o /dev/null -w '%{http_code}' \
https://app.thestorygraph.com/currently-reading/elliek
200
Two hundred. The door opened. Watching that 403 flip to 200 was easily the most satisfying moment of the whole project. The challenge wasn’t checking who I claimed to be, it was checking how I spoke, and now I had the right accent. 200。门开了。看着 403 变成 200,这绝对是整个项目中最令人满足的时刻。挑战模式检查的不是我声称自己是谁,而是我说话的方式,现在我掌握了正确的口音。
Building the actual thing: With the hard part de-risked, the rest came together quickly, which is how these things usually go once the scary unknown is gone. The service is a small Rack app. One real endpoint, /reads.json, that takes a username and a limit. It fetches two pages through curl-impersonate, hands the HTML to a Nokogiri scraper that pulls out each book’s title, author, and cover, and returns a clean little JSON payload. There’s a /health route and a tiny index page, and that’s the whole surface area.
构建实际应用:随着最困难的部分风险被排除,剩下的工作很快就完成了,一旦可怕的未知因素消失,事情通常就是这样。该服务是一个小型 Rack 应用。只有一个真正的端点 /reads.json,它接收用户名和数量限制。它通过 curl-impersonate 获取两个页面,将 HTML 交给 Nokogiri 抓取器,提取每本书的标题、作者和封面,并返回一个整洁的 JSON 数据包。还有一个 /health 路由和一个微小的索引页面,这就是全部内容。
A few decisions I’m happy with: 我对几个决定感到满意:
- Caching. Scraping is slow and I didn’t want to hammer StoryGraph every time TRMNL polls. An in-memory cache with a thirty-minute TTL means repeated polls cost nothing and I stay a good citizen. 缓存。 抓取速度很慢,我不想在 TRMNL 每次轮询时都去轰炸 StoryGraph。使用 30 分钟 TTL 的内存缓存意味着重复轮询不会产生额外开销,我也能做一个守规矩的用户。
- Failing soft. If a scrape fails, the endpoint still returns 200 with an error field instead of a 500. A blank e-ink screen tells you nothing. A screen that says “couldn’t load, is the profile public?” at least tells you where to look. 软失败。 如果抓取失败,端点仍然返回 200 并带有一个错误字段,而不是 500。空白的电子墨水屏什么也告诉不了你。而显示“无法加载,个人资料是否公开?”的屏幕至少能告诉你该去哪里检查。
- Retries. StoryGraph occasionally drops a rapid second request, so the fetcher retries with a short backoff. 重试。 StoryGraph 有时会丢弃快速发出的第二个请求,因此抓取器会进行短时间的退避重试。
Then the templates. TRMNL supports four layout sizes (full, two halves, and a quadrant), and I wrote Liquid for each, with the empty and error states baked in so the display always has something sensible to show. I wrapped it all in a Docker image that installs the right curl-impersonate build for the architecture, and I had a passing test suite running against saved HTML fixtures so I wasn’t hitting the network on every run. It worked. Locally, it really worked.
然后是模板。TRMNL 支持四种布局尺寸(全屏、两半和四分之一),我为每种尺寸编写了 Liquid 模板,并内置了空状态和错误状态,这样显示屏总能显示合理的内容。我将所有内容打包进一个 Docker 镜像中,该镜像会为特定架构安装正确的 curl-impersonate 版本。我还编写了一套针对保存的 HTML 固定数据运行的测试套件,这样我就不必在每次运行时都访问网络。它成功了。在本地,它真的成功了。
The second wall (this one was my fault): I pointed the scraper at my own profile and got a redirect to a sign-in page. My books were nowhere. It took me an embarrassing minute to realize: my StoryGraph profile was private. Of course it was. Public profiles scrape fine; private ones bounce you to the login wall, exactly as they should. The fix was a single toggle in my StoryGraph settings, and suddenly there I was in JSON form: Eloquent Ruby, Effective Testing with RSpec 3, The Staff Engineer’s Path. Reader, my to-read pile is exactly as on-brand as you’d expect. 第二道墙(这次是我的错):我将抓取器指向我自己的个人资料,结果被重定向到了登录页面。我的书不见了。我尴尬地花了一分钟才意识到:我的 StoryGraph 个人资料是私密的。当然是这样。公开资料可以正常抓取;私密资料会把你弹回登录页面,这很正常。解决方法是在 StoryGraph 设置中切换一个开关,突然间我就以 JSON 形式出现了:《Eloquent Ruby》、《Effective Testing with RSpec 3》、《The Staff Engineer’s Path》。读者们,我的待读清单正如你们所料,非常符合我的“人设”。
To see it on the actual device, I ran the container locally and pointed a cloudflared tunnel at it, which gave me a temporary public URL to paste into TRMNL. A minute later my little e-ink screen lit up with my current reads. I may have done a small chair dance.
为了在实际设备上看到效果,我在本地运行了容器,并用 cloudflared 隧道指向它,这给了我一个临时的公共 URL,可以粘贴到 TRMNL 中。一分钟后,我的小电子墨水屏亮了起来,显示着我正在读的书。我可能在椅子上跳了一小段舞。
The twist: The tunnel was never meant to be permanent (it runs off my laptop, and the URL changes every time it restarts), so the next step was deploying somewhere real. I built the Docker image for Fly.io, set my username, and shipped it. The health check was green. 转折:隧道本来就不是永久的(它在我的笔记本电脑上运行,每次重启 URL 都会变),所以下一步是部署到真正的服务器上。我为 Fly.io 构建了 Docker 镜像,设置了用户名,然后发布了它。健康检查显示为绿色。