What job interviews taught me about Kubernetes
What job interviews taught me about Kubernetes
招聘面试教会了我关于 Kubernetes 的什么
So I’ve been job hunting lately. Reading job postings, doing interviews, talking to engineering teams at like a dozen companies. And I noticed something compared to five years ago when I was last doing this: literally everyone is on Kubernetes now. Every single company I talked to. 最近我一直在找工作。我阅读了招聘启事,参加了面试,并与大约十几家公司的工程团队进行了交流。与五年前我上次找工作时相比,我注意到了一件事:现在几乎每个人都在使用 Kubernetes。我交谈过的每一家公司都是如此。
Last time I was job hunting that wasn’t the case at all. There were basically three camps: the rare Kubernetes adopters, the systemd-on-VM/VPS/EC2 crowd, and the serverless people (Lambda, Cloud Run, etc.). That surprised me, because where I work we have actual Big Tech-scale problems, so K8s makes obvious sense for us. But a 10-person startup with two services? None of these places were doing microservices or anything close to high scale. So I asked why. Spoiler: they don’t care much about the technical side of K8s. 上次找工作时情况完全不同。当时基本上分为三个阵营:稀有的 Kubernetes 采用者、使用 systemd 运行在 VM/VPS/EC2 上的群体,以及无服务器(Serverless)用户(如 Lambda、Cloud Run 等)。这让我很惊讶,因为我目前工作的地方确实面临大厂规模的技术挑战,所以使用 K8s 对我们来说显而易见。但对于一个只有两个服务、10 人的初创公司呢?这些地方都没有在做微服务,也没有达到高并发规模。所以我问了原因。剧透一下:他们并不太关心 K8s 的技术层面。
Why? A technical interview is actually a great place to ask why, especially when you’re talking directly to the CTO. So I did. The answers were basically the same everywhere. 为什么?技术面试其实是询问原因的好机会,尤其是当你直接与 CTO 对话时。所以我问了。各处的答案基本一致。
Uniformity
统一性
First one was uniformity. Every service deploys the same way. No one secretly knowing that the payments service runs on some bare VM with a cursed bash script from 2019 while the API is on Docker Compose because nobody ever touched it. One way to deploy, for everything. 首先是统一性。每个服务的部署方式都一样。不会出现那种“只有某个人知道支付服务运行在 2019 年遗留的糟糕 bash 脚本的裸机 VM 上,而 API 却运行在 Docker Compose 上,因为没人敢动它”的情况。所有服务都采用同一种部署方式。
Standardized knowledge
标准化知识
Second was shared, hireable knowledge. K8s is basically a lingua franca now. My first day at my current job, I pulled up the repo with the Helm charts and Kube configs and had a solid picture of the whole architecture within an hour. The knowledge is in the YAML, not stuck in someone’s head. Lose someone, their replacement isn’t spending three weeks digging through docs trying to figure out how anything runs. 其次是共享的、可雇佣的知识。K8s 现在基本上是通用语言。在我目前工作的入职第一天,我拉取了包含 Helm charts 和 Kube 配置的仓库,不到一小时就对整个架构有了清晰的了解。知识存在于 YAML 文件中,而不是困在某个人的脑子里。如果有人离职,接替者不需要花三周时间去翻阅文档来弄清楚系统是如何运行的。
At my current company, on-call SREs can keep any service up even if they’ve never touched it before. They know Kubernetes, and Kubernetes patterns are the same everywhere for all teams. Try doing that with a bunch of VMs where every service is set up differently. (Caveat: this only holds if nobody went exotic with the setup, of course.) 在我现在的公司,值班的 SRE 即使从未接触过某个服务,也能维持其正常运行。他们了解 Kubernetes,而 Kubernetes 的模式对所有团队来说都是一样的。试想一下,如果是一堆配置各异的 VM,你能做到这一点吗?(当然,前提是没人搞出什么奇葩的配置。)
Tracing who does what
追踪谁做了什么
Third was traceability (with or without compliance). At my current company, nobody can just kubectl apply -f something straight to the cluster. You push a Helm chart to git, there’s a trace, there’s an MR approval process, then FluxCD or ArgoCD handles the actual deployment. Nothing happens in the shadow. That composes really well with compliance: it’s basically how we ace ISO certifications. And since GitOps pairs naturally with Kubernetes, you get all of that almost for free.
第三是可追溯性(无论是否涉及合规性)。在我现在的公司,没有人可以直接向集群执行 kubectl apply -f。你将 Helm chart 推送到 git,会有记录,有合并请求(MR)审批流程,然后由 FluxCD 或 ArgoCD 处理实际部署。没有任何操作是在暗箱中进行的。这与合规性配合得非常好:这基本上就是我们顺利通过 ISO 认证的原因。而且由于 GitOps 与 Kubernetes 是天然的搭档,你几乎可以免费获得所有这些好处。
What I took from it
我的感悟
The CTOs I talked to aren’t making a dumb choice. They’re solving real problems. I was focused on the technical side only, and Kube always has been a technical solution to technical problems, for me. But it looks like a lot of CTOs are interested primarily in the non-tech benefits. More than I thought. Their technical problems just don’t require it. I bet you won’t find any topologySpreadConstraints in their manifests, they don’t care. No HPA, no Pod Disruption Budgets, no node affinity rules. Just the same number of nodes they’d have VMs otherwise. But they accepted to pay the price of having a complex piece of software for the organizational benefits.
我交谈过的 CTO 们并没有做出愚蠢的选择。他们是在解决实际问题。我过去只关注技术层面,对我而言,Kube 一直是解决技术问题的技术方案。但看起来许多 CTO 主要看重的是非技术层面的收益,比我想象的要多。他们的技术问题其实并不需要 K8s。我敢打赌,你在他们的清单里找不到任何 topologySpreadConstraints,他们根本不在乎。没有 HPA,没有 Pod 中断预算,没有节点亲和性规则。他们只是拥有和使用 VM 时相同数量的节点。但为了组织层面的收益,他们愿意付出维护复杂软件的代价。
Honestly, I think it’s mostly fine. But I still think most companies should start without it. Clusters are genuinely hard to debug when stuff goes wrong, and at that stage you want your energy on the product, not the infra. When you’re still pitching to your next big customer, spinning up a VPS and doing a dirty git pull is a totally valid emergency fix. Suboptimal, sure. But fast, and you know exactly what’s happening. You really don’t want to spend two hours figuring out why your pod is stuck in CrashLoopBackOff right before a customer call.
老实说,我认为这大多没问题。但我仍然认为大多数公司起步时应该先不用它。当出现问题时,集群确实很难调试,而在那个阶段,你应该把精力放在产品上,而不是基础设施上。当你还在向下一个大客户推销产品时,启动一个 VPS 并执行一个粗糙的 git pull 是完全有效的紧急修复方案。虽然不够完美,但很快,而且你确切地知道发生了什么。你肯定不想在客户电话会议前花两个小时去排查为什么你的 Pod 卡在 CrashLoopBackOff 状态。
Why the shift happened recently
为什么最近发生了这种转变
I still don’t totally get why the shift happened when it did. Five years ago all three camps were doing fine. Now the VM+systemd crowd has basically disappeared from job postings, serverless stayed niche, and K8s just won. My best guesses: managed K8s (EKS, GKE, AKS) got mature and the talent pool flipped: enough people learned it that hiring for anything else became the harder choice. And Helm made “just use someone else’s chart” a real option. But I’m not certain. If you were there for the shift and have a better theory, I’d genuinely like to know. 我仍然不太明白为什么这种转变恰好在此时发生。五年前,这三个阵营都过得不错。现在,VM+systemd 的群体基本上从招聘启事中消失了,Serverless 依然小众,而 K8s 赢了。我的猜测是:托管式 K8s(EKS、GKE、AKS)变得成熟,人才库发生了逆转:足够多的人学会了它,导致招聘其他技术栈反而成了更难的选择。而且 Helm 让“直接使用别人的 chart”成为了真正的可行方案。但我并不确定。如果你亲历了这次转变并有更好的理论,我真的很想知道。
When to use Kubernetes
何时使用 Kubernetes
My personal threshold would be the moment the CTO isn’t the only engineer anymore. As soon as a second person shows up, the problems K8s solves become real. Now you’ve got someone who didn’t set up the servers but needs to deploy. Someone who needs proper access controls, not SSH keys to everything. Someone who’ll leave eventually and take everything they know with them. That’s when you want the system to hold the knowledge, not people. 我个人的门槛是:当 CTO 不再是唯一的工程师时。一旦出现第二个人,K8s 所解决的问题就变得真实了。现在你有了另一个没有参与服务器搭建但需要部署的人。一个需要适当访问控制,而不是拥有所有 SSH 密钥的人。一个最终会离职并带走所有知识的人。那时,你就会希望由系统来承载知识,而不是依赖个人。