Kubernetes In Anger

2026-05-21 Kubernetes In Anger NOTE: Any discussions can be had on Lobsters 0. Quick start (emergency edition) Is this the right guide? YES, if: You’re debugging a live EKS production issue You need to upgrade/change EKS safely You want to prevent common EKS outages You’re oncall for EKS workloads NO, if: You’re learning Kubernetes basics (try the official tutorials first) You need EKS setup instructions (use AWS documentation) You want comprehensive Kubernetes reference (use kubernetes.io) 2026-05-21 Kubernetes In Anger 注意：任何讨论都可以在 Lobsters 上进行。0. 快速入门（紧急版）这是适合你的指南吗？如果是以下情况，请阅读：你正在调试生产环境中的 EKS 问题；你需要安全地升级/更改 EKS；你希望预防常见的 EKS 中断；你负责 EKS 工作负载的运维值班。如果属于以下情况，则不适合：你正在学习 Kubernetes 基础知识（请先尝试官方教程）；你需要 EKS 设置说明（请使用 AWS 文档）；你需要全面的 Kubernetes 参考资料（请使用 kubernetes.io）。

Emergency shortcuts Cluster is on fire right now? → Jump to Section 2.10 Tier-0 Incident Playbook Need to upgrade safely? → Jump to Section 8 Upgrades and maintenance Investigating an incident? → Start with Section 1.2 Quick Cluster Health Snapshot 紧急快捷方式：集群现在正处于“火烧眉毛”的状态？→ 跳转至第 2.10 节“Tier-0 事故手册”。需要安全升级？→ 跳转至第 8 节“升级与维护”。正在调查事故？→ 从第 1.2 节“集群健康状况快速快照”开始。

Prerequisites This guide assumes you know: Basic kubectl commands (get, describe, logs) AWS CLI basics What pods, services, and deployments are How to read YAML manifests 前提条件：本指南假设你了解：基本的 kubectl 命令（get, describe, logs）；AWS CLI 基础知识；什么是 Pod、Service 和 Deployment；如何阅读 YAML 清单。

What makes EKS different EKS is not “just Kubernetes”. Key differences that affect reliability: Pods get real VPC IPs (AWS VPC CNI) AWS services become dependencies (NAT, NLB, EBS) Node limits are AWS EC2 limits Networking failures look like application failures Upgrades affect multiple AWS components EKS 有何不同：EKS 不仅仅是“Kubernetes”。影响可靠性的关键差异包括：Pod 获取真实的 VPC IP（AWS VPC CNI）；AWS 服务成为依赖项（NAT、NLB、EBS）；节点限制即 AWS EC2 限制；网络故障看起来像应用程序故障；升级会影响多个 AWS 组件。

Introduction On running infrastructure There’s a common way of thinking about Kubernetes that goes something like this: you declare what you want, the system converges toward it, and your job is mostly done. Write the YAML, apply it, the scheduler places your pods, the controllers reconcile state, and everything just works. This is roughly true until it isn’t. 引言：关于运行基础设施。人们对 Kubernetes 有一种普遍的看法：你声明你想要什么，系统就会向该状态收敛，你的工作基本就完成了。编写 YAML，应用它，调度器放置你的 Pod，控制器协调状态，一切都能正常工作。这在大多数情况下是对的，直到它不再适用。

The thing about Kubernetes — and EKS specifically — is that it doesn’t fail like a monolith fails. A monolith crashes and you know it. EKS degrades. DNS gets slow. A node hits a network limit you didn’t know existed. Pods keep running but their connections reset every 6 minutes. The dashboard is green. Customers are complaining. You’re staring at healthy pods wondering what’s wrong with your application, when the real problem is three layers down in a conntrack table or a subnet that ran out of IPs. Kubernetes（特别是 EKS）的特点在于，它的故障方式与单体应用不同。单体应用崩溃了，你会立刻知道。而 EKS 会发生降级：DNS 变慢；节点触及了你从未察觉的网络限制；Pod 仍在运行，但连接每 6 分钟重置一次。仪表板显示绿色，但客户在抱怨。你盯着健康的 Pod，纳闷应用程序出了什么问题，而真正的问题其实隐藏在三层之下的 conntrack 表中，或者某个子网的 IP 耗尽了。

Most other platforms fail at the boundary between your code and the infrastructure. EKS fails inside the infrastructure, in ways that look like your code is broken. This is the fundamental debugging challenge: the symptom is always “the app is slow” or “requests are failing”, and the cause is somewhere in a stack of networking, scheduling, storage, and AWS service interactions that your application has no visibility into. 大多数其他平台在代码与基础设施的边界处发生故障。而 EKS 的故障发生在基础设施内部，表现得就像你的代码坏了一样。这是调试的核心挑战：症状总是“应用变慢”或“请求失败”，而原因却隐藏在网络、调度、存储和 AWS 服务交互的堆栈中，你的应用程序对此毫无感知。

This matters because the instinct — “my app is returning 5xx, let me look at my app” — is wrong most of the time in EKS. The 5xx is real. But the fix is often in a probe configuration, a security group limit, a DNS resolver being overwhelmed, or a node that silently filled its conntrack table. 这一点至关重要，因为直觉——“我的应用返回了 5xx 错误，让我检查一下应用”——在 EKS 中大多数时候是错误的。5xx 错误是真实的，但修复方法往往在于探针配置、安全组限制、DNS 解析器过载，或者某个节点静默地填满了 conntrack 表。

The two jobs If you run EKS in production, you have two jobs: The first is building workloads that survive the platform misbehaving. Probes that don’t cascade. Graceful shutdowns that actually drain. Pod distributions that tolerate losing a node or an AZ without paging anyone. This is the preventive work — the engineering equivalent of washing your hands. 两项工作：如果你在生产环境中运行 EKS，你有两项工作：第一项是构建能够抵御平台异常的工作负载。例如：不会引发级联故障的探针；能够真正执行排空（drain）的优雅关闭；能够容忍节点或可用区（AZ）丢失而无需触发报警的 Pod 分布。这是预防性工作——相当于工程领域的“勤洗手”。

The second is diagnosing live systems when things go wrong anyway. Connecting to a cluster that’s on fire, figuring out what’s actually broken vs what’s just symptomatic, collecting evidence before it disappears, and fixing the right thing without making the incident worse. This is the equivalent of surgery — you’re operating on a patient that’s still awake and serving traffic. Both matter. Most guides only cover the first one. This guide is about both. 第二项是在系统确实出问题时诊断实时系统。连接到“着火”的集群，区分什么是真正的故障，什么是症状，在证据消失前收集证据，并在不加剧事故的情况下修复正确的问题。这相当于外科手术——你正在为一个仍然清醒且在处理流量的病人进行手术。两者都很重要。大多数指南只涵盖第一项，而本指南两者兼顾。

It’s a collection of patterns, failure modes, and diagnostic workflows that came from running EKS in production — the things that caused real incidents, the things that made debugging take hours instead of minutes, and the guardrails that prevented repeat occurrences. 这是一系列模式、故障模式和诊断工作流的集合，源自生产环境运行 EKS 的经验——包括导致真实事故的原因、让调试耗时数小时而非数分钟的因素，以及防止问题再次发生的护栏。

Who is this for? This guide is not for beginners. There’s a gap between knowing Kubernetes concepts (pods, deployments, services, kubectl) and actually being able to keep an EKS cluster healthy in production. There’s a fumbling phase where you’ve read the docs, passed the certification maybe, deployed some workloads — and then something breaks at 2am and you realize you don’t know where to look or what’s safe to touch. 这是写给谁的？本指南不适合初学者。了解 Kubernetes 概念（Pod、Deployment、Service、kubectl）与真正能够在生产环境中保持 EKS 集群健康之间存在差距。有一个摸索阶段：你读过文档，可能通过了认证，部署了一些工作负载——然后凌晨 2 点出了问题，你意识到自己不知道该看哪里，也不知道什么操作是安全的。

This assumes you know the basics. It does not assume you know how to debug a cluster that’s misbehaving, how EKS-specific failure modes differ from generic Kubernetes ones, or what the safe sequence of actions is when you’re staring at a production incident. 本指南假设你了解基础知识。它并不假设你知道如何调试异常集群，不知道 EKS 特有的故障模式与通用 Kubernetes 故障有何不同，也不知道当你面对生产事故时，什么才是安全的操作顺序。

What you won’t find here: how to set up EKS, what a pod is, or how to write a Deployment manifest. What you will find: what to do when pods are Pending and you don’t know why, how to tell if DNS is the problem or just a symptom, why your NLB keeps resetting connections, and how to collect evidence before the cluster auto-heals and destroys your ability to do an RCA. 你在这里找不到的内容：如何设置 EKS、什么是 Pod，或如何编写 Deployment 清单。你将找到的内容：当 Pod 处于 Pending 状态且你不知道原因时该怎么办；如何判断 DNS 是问题根源还是仅仅是症状；为什么你的 NLB 不断重置连接；以及如何在集群自动修复并破坏你进行根本原因分析（RCA）的能力之前收集证据。

How to read this guide The guide is organized by domain — networking, storage, security, observability, scaling, upgrades, and so on. Each section mixes both jobs: how to build it right, and how to debug it when it breaks. You’ll find design patterns and diagnostic runbooks side by side, because in practice you need both at the same time. 如何阅读本指南：本指南按领域组织——网络、存储、安全、可观测性、扩展、升级等。每一节都结合了上述两项工作：如何正确构建，以及在出现故障时如何调试。你会发现设计模式和诊断手册并列在一起，因为在实践中，你同时需要这两者。

You can read it front-to-back if you’re setting up a new cluster or onboarding to an existing one. Or you can jump to the relevant section when something breaks — each one is self-contained enough to be useful on its own. If the cluster is on fire right now, start at Section 1. It gives you a triage sequence to identify the failure domain in under 2 minutes. 如果你正在设置新集群或接手现有集群，可以从头到尾阅读。或者，当出现问题时，你可以直接跳转到相关章节——每一节都足够独立，可以单独使用。如果集群现在正处于紧急状态，请从第 1 节开始。它提供了一个分类序列，让你在 2 分钟内识别故障领域。

How to dive into an EKS cluster When production is broken the first job is to mitigate — stop the bleeding, restore service, reduce blast radius. But you can’t mitigate effectively if you don’t know what’s broken. Rollback the wrong thing and you’ve wasted 10 minutes. Upsize the wrong component and nothing changes. So the actual first job is: figure
如何深入排查 EKS 集群：当生产环境崩溃时，首要任务是缓解——止血、恢复服务、缩小影响范围。但如果你不知道哪里坏了，就无法有效缓解。回滚了错误的东西，你浪费了 10 分钟；扩容了错误的组件，什么都不会改变。所以，真正的第一项工作是：找出……