Clickhouse is winning the Observability Wars

Clickhouse is winning the Observability Wars

Clickhouse 正在赢得可观测性之战

For roughly the last ten years, a meaningful percentage of my working hours have been spent thinking about observability. If you’re not familiar with the term, “observability” is what we call it now that “monitoring” doesn’t sound expensive enough. The actual work is unglamorous in that you collect a lot of logs, some metrics, a few traces, and then you give them to people.

在过去大约十年里,我相当一部分的工作时间都在思考“可观测性”(observability)。如果你对这个术语不熟悉,简单来说,现在因为“监控”(monitoring)这个词听起来不够昂贵,所以我们改叫它“可观测性”。这项工作的本质并不光鲜,你只是收集大量的日志、一些指标和少量追踪数据,然后把它们提供给用户。

I generally like my job. I like that we’re always trying new ideas and approaches. I like the fact that when things go wrong, the answer is almost always sitting there in the data, waiting to be found by whoever is patient enough to look. But I want to be honest with you: in ten years of doing this work, across a half-dozen companies and every observability platform you’ve heard of and a few you probably haven’t, logs have never stopped being the worst part of the job. They were the worst part when I started. They are the worst part today. I fully expect them to be the worst part of this job forever until the robots rise up and rip my head off in one clean sweep.

我总体上喜欢我的工作。我喜欢我们总是在尝试新的想法和方法。我喜欢这样一个事实:当事情出错时,答案几乎总是存在于数据中,等待着那些有足够耐心的人去发现。但我想对你坦诚:在从事这项工作的十年里,我待过六家公司,用过你听说过的所有可观测性平台,甚至还有一些你没听说过的,但日志始终是这份工作中最糟糕的部分。我刚开始时它是最糟糕的,今天它依然是最糟糕的。我完全预料到它将永远是这份工作中最糟糕的部分,直到机器人崛起并彻底终结我。

I’ve written about why logs are terrible before, so I’ll spare you the full lecture and give you the short version. Every developer’s expectations for logs are set by a single formative experience: the syslog box. Or a container running locally. Or tail -f on a production server they probably shouldn’t have SSH’d into. The point is that at some early, tender moment in their career, they had an experience with logs that was flawless. They ran grep and something useful came back. They piped it into jq and got exactly what they needed. This experience is the observability equivalent of a first kiss. It ruins them for everything that comes after.

我之前写过关于为什么日志很糟糕的文章,所以我就不长篇大论了,直接说重点。每个开发者对日志的期望都是由一次形成性的经历所决定的:syslog 盒子,或者本地运行的容器,又或者是他们在不该 SSH 登录的生产服务器上执行 tail -f。重点在于,在他们职业生涯早期某个稚嫩的时刻,他们有过一次完美的日志体验。他们运行了 grep,得到了有用的结果;他们通过管道传给 jq,得到了他们想要的东西。这种体验在可观测性领域相当于“初吻”,它让之后的一切体验都显得索然无味。

Because here is the thing about that flawless experience: it works because the system is small, the volume is trivial, and the person querying is the same person who wrote the log line. There is no schema drift, no cardinality explosion, no cross-team consumer with dashboard expectations, no VP asking why the “revenue events” graph has a gap in it. Then there are forty services. Now there are four hundred. Now the logs are being consumed not just by developers but by customer service, who need to look up a specific user’s failed checkout from Tuesday. And by the data team, who are quietly building a business-critical dashboard on top of a log line that a backend engineer is about to refactor without telling anyone. And by the on-call, who at 3 AM does not want to learn a new query language, does not want to think about index patterns, and would like the search bar to just work.

因为关于那种完美体验,真相是:它之所以有效,是因为系统很小,数据量微不足道,而且查询的人就是编写日志行的人。没有模式漂移(schema drift),没有基数爆炸(cardinality explosion),没有带着仪表盘期望的跨团队消费者,也没有副总裁问为什么“收入事件”图表出现了断层。后来,服务从四十个变成了四百个。现在,日志不仅被开发者使用,还被客服团队使用,他们需要查找某个用户周二结账失败的记录;还被数据团队使用,他们正悄悄地基于某行日志构建业务关键仪表盘,而后端工程师正打算在不通知任何人的情况下重构这行代码;还被值班人员使用,他们在凌晨 3 点不想学习新的查询语言,不想考虑索引模式,只想搜索栏能直接好用。

So you have a technical problem — the volume is enormous, the shape is inconsistent, the queries are unpredictable — sitting on top of an expectations problem, which is worse. Developers want logs instantly, they want to run arbitrary operations on them, and they will not commit to a schema. Meanwhile the less-technical consumers of that same data want the dashboards to be stable forever, the UI to be forgiving, and the whole thing to feel like a normal product. These two audiences are, in most practical respects, at war with each other, and you are the diplomat.

所以你面临一个技术问题——数据量巨大、格式不一致、查询不可预测——而这之上还有一个更糟糕的期望问题。开发者想要即时获取日志,想要对它们进行任意操作,并且拒绝遵守固定的模式。与此同时,那些技术能力较弱的数据消费者则希望仪表盘永远稳定,UI 具有容错性,整个系统用起来像个成熟的产品。在大多数实际层面,这两类受众处于对立状态,而你就是那个外交官。

Clickhouse

ClickHouse came out of Yandex, where it was built to chew through analytical queries against absurd volumes of clickstream data. It was not designed for observability. It just happens to be shockingly good at it, because clickstream data and observability data have a lot in common: high volume, append-heavy, time-ordered, mostly read in aggregate, and every so often you need to reach in and find one specific needle.

ClickHouse 源自 Yandex,最初是为了处理海量点击流数据的分析查询而构建的。它并非为可观测性而设计,但它恰好在这方面表现得惊人地好,因为点击流数据和可观测性数据有很多共同点:高吞吐量、以追加为主、按时间排序、大多进行聚合读取,且偶尔需要从中精准定位某一条特定数据。

You can run it yourself with Helm charts. You can point Grafana at it via the ClickHouse plugin, or use their own web UI, or bring your own frontend. Their docs are actually good, which I mention because it’s rare enough to be worth flagging. I’ve never used their ClickStack setup though, so YMMV. For observability specifically, the OpenTelemetry Collector has a ClickHouse exporter, which means you can pipe OTLP data straight in and let it manage the initial schema for you. ClickHouse is designed to scan billions of rows and ingest an amount of data that, when you first see the numbers, makes you assume they’re lying. They’re not lying. You query it with SQL, which is a language that already exists and was not created by a startup two weeks ago.

你可以通过 Helm charts 自行运行它。你可以通过 ClickHouse 插件将 Grafana 指向它,或者使用他们自己的 Web UI,甚至自带前端。他们的文档写得确实不错,我提到这一点是因为这非常罕见,值得称赞。不过我从未使用过他们的 ClickStack 设置,所以效果可能因人而异。对于可观测性而言,OpenTelemetry Collector 有一个 ClickHouse 导出器,这意味着你可以直接将 OTLP 数据导入其中,并让它为你管理初始模式。ClickHouse 的设计目标是扫描数十亿行数据,并摄入海量数据——当你第一次看到这些数字时,你会以为他们在撒谎。但他们没有。你使用 SQL 进行查询,这是一种早已存在、而不是两周前由某家初创公司创造的语言。

But why Clickhouse specifically for logs?

为什么专门选择 Clickhouse 来处理日志?

I’m ranting about logs and then I’m explaining why I like to administer Clickhouse more. Let me take a second and explain why Clickhouse is really good at logs at scale. Logs, as a data shape, have some peculiar properties. They’re append-only. You never update a log line, and you almost never delete a single one, though you delete a lot of them at once when retention kicks in. They arrive roughly in time order, though never actually in order. They’re read in bursts where nobody looks at logs for days, and then during an incident somebody wants to scan a billion of them in seconds. They’re highly compressible, because most of the bytes in your logs are repeated: the same service names, the same hostnames, the same error strings, the same JSON keys, over and over and over again.

我一直在抱怨日志,然后又解释为什么我更喜欢管理 Clickhouse。让我花点时间解释一下为什么 Clickhouse 在处理大规模日志时表现如此出色。日志作为一种数据形态,具有一些独特的属性。它们是只追加的。你从不更新某一行日志,也几乎从不删除单条日志,尽管当保留策略生效时,你会一次性删除大量日志。它们大致按时间顺序到达,但实际上从未完全按顺序排列。它们的读取是突发性的:可能几天没人看日志,但在发生事故时,有人会想在几秒钟内扫描十亿条日志。它们具有极高的压缩率,因为日志中的大部分字节都是重复的:相同的服务名称、相同的主机名、相同的错误字符串、相同的 JSON 键,一遍又一遍地重复。

And critically, when you query them, you almost always want either a narrow time range across all fields or an aggregation across a wide time range with a few filters. You very rarely want “give me one specific row by ID” the way you would from a transactional database. (There are exceptions when its something like GDPR or compliance logging which is its own subgenre of nightmares). In a row-oriented database — Elasticsearch, Postgres, MySQL — the data for a single log line is stored together on disk. If your log has 40 fields and your query only cares about 3 of them, tough luck, you’re reading all 40 from disk anyway. The database will filter it in memory, but the disk I/O has already happened.

至关重要的是,当你查询它们时,你几乎总是想要跨所有字段的窄时间范围查询,或者跨宽时间范围的聚合查询并带有少量过滤条件。你很少会像在事务型数据库中那样需要“按 ID 获取某一行特定数据”。(当然,GDPR 或合规性日志等情况除外,那是另一类噩梦)。在面向行的数据库(如 Elasticsearch、Postgres、MySQL)中,单行日志的所有数据在磁盘上是存储在一起的。如果你的日志有 40 个字段,而你的查询只关心其中 3 个,那很遗憾,你仍然必须从磁盘读取所有 40 个字段。数据库会在内存中进行过滤,但磁盘 I/O 已经发生了。

ClickHouse stores each column separately. If your query says SELECT service, status_code, count() FROM logs WHERE timestamp…

ClickHouse 将每一列单独存储。如果你的查询是 SELECT service, status_code, count() FROM logs WHERE timestamp...