Great Stack to Doesn't Work #2 — Kafka: "Where Did My Messages Go?"

Great Stack to Doesn’t Work #2 — Kafka: “Where Did My Messages Go?”

优秀技术栈为何失效 #2 — Kafka：“我的消息去哪了？”

A survival guide for when everything goes wrong in production. There’s a moment every engineer who works with Kafka experiences. You check the producer. Messages are sending. You check the consumer. Nothing. The consumer group shows zero lag because there’s nothing to lag behind — as far as the consumer knows, the topic is empty. But it’s not empty. The messages are there. Somewhere. In some partition, at some offset, behind some configuration you set six months ago and forgot about. Kafka doesn’t lose messages. But it’s very good at hiding them from you.

这是一份生产环境出现故障时的生存指南。每一位使用 Kafka 的工程师都会经历这样一个时刻：你检查生产者，消息正在发送；你检查消费者，却什么也没有。消费者组显示滞后（lag）为零，因为没有东西需要追赶——在消费者看来，这个主题是空的。但它并非真的为空。消息就在那里，在某个分区的某个偏移量（offset）上，被你六个月前设置并早已遗忘的某个配置挡住了。Kafka 不会丢失消息，但它非常擅长把消息藏起来。

Consumer Lag: The Number Everyone Watches Wrong

消费者滞后（Consumer Lag）：每个人都看错的指标

Consumer lag is the difference between the latest offset in a partition and the offset your consumer group has committed. Simple concept. Dangerous in practice. The mistake: treating lag as a single number. Lag is per-partition. If you have 30 partitions and one consumer is stuck on partition 17 while the others are healthy, the total lag looks manageable. But partition 17’s data is hours behind, and whatever downstream system depends on that data is serving stale results.

消费者滞后是指分区中最新的偏移量与消费者组已提交的偏移量之间的差值。概念很简单，但在实践中却很危险。常见的错误是：将滞后视为一个单一的数字。滞后是按分区计算的。如果你有 30 个分区，其中一个消费者卡在分区 17，而其他分区运行正常，那么总的滞后量看起来可能还在可控范围内。但分区 17 的数据实际上已经滞后了数小时，任何依赖该数据的下游系统都在提供过时的结果。

Monitor lag per partition. Tools like Burrow, Kafka Exporter for Prometheus, or even kafka-consumer-groups.sh --describe break it down. If one partition’s lag is growing while others are stable, you have a stuck consumer, a hot partition, or a poison message. A poison message is a record your consumer can’t process — malformed data, unexpected schema, null where it shouldn’t be null. The consumer throws an exception, the offset doesn’t commit, and it retries the same message forever. Lag grows. The consumer looks “alive” because it’s processing — just not making progress.

请务必监控每个分区的滞后情况。Burrow、Prometheus 的 Kafka Exporter，甚至 kafka-consumer-groups.sh --describe 等工具都能提供详细拆解。如果某个分区的滞后在增加而其他分区稳定，说明你遇到了消费者卡死、热点分区或“毒丸消息”（poison message）。毒丸消息是指消费者无法处理的记录——例如格式错误的数据、意外的模式（schema），或本不该为空的字段出现了 null。消费者抛出异常，偏移量无法提交，导致它不断重试同一条消息。滞后随之增加。消费者看起来是“存活”的，因为它确实在处理，只是没有任何进展。

The fix: dead letter queues. After N retries, move the message to a separate topic, commit the offset, and move on. Alert on the dead letter topic. Investigate later. Don’t let one bad record block millions of good ones.

解决方法是：使用死信队列（Dead Letter Queues）。在重试 N 次后，将消息移至单独的主题，提交偏移量，然后继续处理。对死信主题设置告警，稍后再进行调查。不要让一条坏记录阻塞数百万条好记录。

Rebalance Storms: The Silent Killer

再平衡风暴：无声的杀手

Consumer rebalancing is Kafka’s mechanism for redistributing partitions across consumers in a group. When a consumer joins or leaves, Kafka reassigns partitions. During rebalance, all consumers in the group stop processing. For a few seconds, nobody’s doing anything. This is fine. Unless it happens every 30 seconds. Rebalance storms happen when Kafka thinks a consumer is dead, removes it from the group, triggers a rebalance, then the consumer comes back, joins the group, triggers another rebalance, and the cycle repeats.

消费者再平衡（Rebalance）是 Kafka 在消费者组内重新分配分区的机制。当消费者加入或离开时，Kafka 会重新分配分区。在再平衡期间，组内所有消费者都会停止处理。几秒钟内什么都不会发生，这没问题。但如果每 30 秒发生一次，就有问题了。再平衡风暴发生在 Kafka 误认为某个消费者已死并将其从组中移除，触发再平衡；随后该消费者又恢复并重新加入组，再次触发再平衡，如此循环往复。

Three timeout settings control this:

session.timeout.ms: how long Kafka waits for a heartbeat before declaring the consumer dead. Default: 45 seconds.
heartbeat.interval.ms: how often the consumer sends heartbeats. Default: 3 seconds.
max.poll.interval.ms: how long between two poll() calls before Kafka kicks the consumer out. Default: 5 minutes.

有三个超时设置控制此过程：

session.timeout.ms：Kafka 在判定消费者死亡前等待心跳的时间。默认值：45 秒。
heartbeat.interval.ms：消费者发送心跳的频率。默认值：3 秒。
max.poll.interval.ms：两次 poll() 调用之间允许的最大间隔，超过此时间 Kafka 会将消费者踢出。默认值：5 分钟。

The most common cause of rebalance storms: max.poll.interval.ms is too short for your processing time. Your consumer polls 500 records, spends 6 minutes processing them, and by the time it polls again, Kafka has already declared it dead and rebalanced.

再平衡风暴最常见的原因是：max.poll.interval.ms 设置得比你的处理时间短。你的消费者拉取了 500 条记录，花了 6 分钟处理，等它再次拉取时，Kafka 已经判定它死亡并触发了再平衡。

Fixes: Increase max.poll.interval.ms to match your worst-case processing time. Decrease max.poll.records so each batch processes faster. Use static.group.instance.id — this enables static membership, which means Kafka won’t immediately rebalance when a consumer temporarily disconnects. It waits for session.timeout.ms to expire first. Use cooperative rebalancing (partition.assignment.strategy = CooperativeStickyAssignor) — instead of stopping all consumers during rebalance, it only reassigns the affected partitions.

解决方法：增加 max.poll.interval.ms 以匹配最坏情况下的处理时间。减少 max.poll.records 以加快每批次的处理速度。使用 static.group.instance.id——这能启用静态成员身份，意味着当消费者暂时断开连接时，Kafka 不会立即触发再平衡，而是会先等待 session.timeout.ms 过期。使用协作式再平衡（partition.assignment.strategy = CooperativeStickyAssignor）——它不会在再平衡期间停止所有消费者，而只会重新分配受影响的分区。

One team I worked with had a 12-consumer group processing payment events. Every few minutes, all processing stopped for 10-15 seconds during rebalance. Twelve times an hour. That’s 2 minutes of downtime every hour in a payment pipeline. The fix was adding static group instance IDs and switching to cooperative rebalancing. Total rebalance disruption dropped from 2 minutes per hour to near zero.

我曾与一个团队合作，他们有一个 12 个消费者的组在处理支付事件。每隔几分钟，所有处理就会因再平衡而停止 10-15 秒。每小时发生 12 次，这意味着支付流水线每小时有 2 分钟的停机时间。解决方法是添加静态组实例 ID 并切换到协作式再平衡。再平衡造成的总中断时间从每小时 2 分钟降至几乎为零。

Exactly-Once: The Myth and the Reality

精确一次（Exactly-Once）：神话与现实

Kafka advertises exactly-once semantics. Here’s what that actually means. Idempotent producer (enable.idempotence = true): Kafka deduplicates messages from the same producer session. If a network retry causes the producer to send the same message twice, the broker detects the duplicate and discards it. This prevents duplicates within a single producer session. If the producer restarts, it gets a new session, and deduplication doesn’t cross sessions.

Kafka 宣传其具备“精确一次”语义。其实际含义如下：幂等生产者（enable.idempotence = true）：Kafka 会对来自同一生产者会话的消息进行去重。如果网络重试导致生产者发送了两次相同的消息，Broker 会检测到重复并将其丢弃。这防止了单个生产者会话内的重复。如果生产者重启，它会获得一个新的会话，而去重机制不会跨越会话。

Transactional producer + consumer: For true exactly-once across produce-and-consume workflows, you need transactions.

producer.beginTransaction();
producer.send(outputTopic, processedRecord);
producer.sendOffsetsToTransaction(consumerOffsets, consumerGroupId);
producer.commitTransaction();

This atomically writes the output record AND commits the consumer offset. Either both happen or neither does. If the transaction fails, the consumer re-reads the input, reprocesses it, and tries again.

事务性生产者 + 消费者：要在生产和消费的工作流中实现真正的“精确一次”，你需要事务。

producer.beginTransaction();
producer.send(outputTopic, processedRecord);
producer.sendOffsetsToTransaction(consumerOffsets, consumerGroupId);
producer.commitTransaction();

这会原子性地写入输出记录并提交消费者偏移量。要么两者都成功，要么都不成功。如果事务失败，消费者会重新读取输入、重新处理并再次尝试。

The reality check: Exactly-once works within Kafka. The moment your consumer writes to an external database, you’re back to at-least-once unless you implement idempotency on the database side. Transactions add latency. Each transaction involves coordination between the producer, the transaction coordinator, and the brokers hosting the output partitions. Most systems don’t need exactly-once. If your consumer can handle duplicates (idempotent writes, upserts, deduplication at the application layer), at-least-once is simpler and faster. Don’t reach for exactly-once because it sounds correct. Reach for it when duplicate processing would cause real damage — financial transactions, inventory counts, billing events. For analytics, logging, and notifications, at-least-once with deduplication is the pragmatic choice.

现实情况是：精确一次仅在 Kafka 内部有效。一旦你的消费者写入外部数据库，除非你在数据库端实现幂等性，否则你又回到了“至少一次”（at-least-once）模式。事务会增加延迟。每个事务都涉及生产者、事务协调器以及托管输出分区的 Broker 之间的协调。大多数系统并不需要精确一次。如果你的消费者可以处理重复数据（幂等写入、Upsert、应用层去重），那么“至少一次”更简单、更快速。不要因为听起来很完美就盲目追求精确一次。只有在重复处理会造成实际损害时（如金融交易、库存统计、计费事件）才使用它。对于分析、日志和通知，采用“至少一次”加应用层去重是更务实的选择。

Partition Strategies: The Decision That Haunts You

分区策略：困扰你的决策

Once you choose a partition key, changing it later means reprocessing everything. Choose carefully. Key-based partitioning (default when you set a key): all messages with the…

一旦选择了分区键，后续更改意味着需要重新处理所有数据。请谨慎选择。基于键的分区（设置键时的默认行为）：所有具有相同…的消息。