Responding to a Compromised AWS Access Key
Responding to a Compromised AWS Access Key
应对 AWS 访问密钥泄露
You wake up to this email from AWS: “Irregular Activity Detected for Your AWS Access Key”. As part of our standard monitoring of AWS systems, we observed anomalous activity in your AWS account that indicated your AWS access key(s), along with the corresponding secret key, may have been inappropriately accessed by a third party. Your stomach drops. The email links to a compromised access key: AKIA1234567890ABCDEF. User: app-integration-user. Event: GetCallerIdentity. Time: yesterday at 12:11:58 UTC. IP: 198.51.100.50.
你一觉醒来,收到 AWS 发来的这封邮件:“检测到您的 AWS 访问密钥存在异常活动”。作为 AWS 系统标准监控的一部分,我们观察到您的 AWS 账户中存在异常活动,表明您的 AWS 访问密钥及其对应的私钥可能已被第三方非法访问。你顿时心沉谷底。邮件中列出了被泄露的访问密钥:AKIA1234567890ABCDEF。用户:app-integration-user。事件:GetCallerIdentity。时间:昨天 12:11:58 UTC。IP:198.51.100.50。
AWS gives you four steps: Rotate the key. Check CloudTrail for unwanted activity. Review account for unexpected usage. Respond to the support case. Four steps. Clean. Linear. Assumes everything goes right. It won’t. AWS 给出了四个步骤:轮换密钥、检查 CloudTrail 以查找异常活动、审查账户是否存在意外使用、回复支持工单。这四个步骤看起来简洁、线性,且假设一切都会顺利进行。但现实往往并非如此。
What AWS Documentation Assumes
AWS 文档的假设
AWS’s steps assume: CloudTrail is already enabled and logs are queryable. Someone on your team knows how to read CloudTrail. You have time to investigate without pressure. The only damage is the exposed key. Rotating the key is enough to fix it. AWS 的步骤假设:CloudTrail 已启用且日志可查询;你的团队中有人知道如何读取 CloudTrail;你有充足的时间在无压力的情况下进行调查;唯一的损失就是密钥泄露;轮换密钥足以解决问题。
In reality: CloudTrail might not be enabled. Or enabled but logs are in an S3 bucket nobody checks. The person who set up the account left months ago. You have 4 hours before customers start calling about errors. The attacker might have created backdoor credentials, roles, or policies while they were in. Rotating the key stops them from using that key. But if they left a trail of IAM users, keys, or assumed roles behind, you’re still exposed. 现实情况是:CloudTrail 可能未启用;或者虽然启用了,但日志存放在无人查看的 S3 存储桶中;当初设置账户的人几个月前就离职了;距离客户因报错而投诉可能只剩 4 小时;攻击者在入侵期间可能已经创建了后门凭证、角色或策略。轮换密钥确实能阻止他们继续使用该密钥,但如果他们留下了 IAM 用户、密钥或已扮演的角色,你依然处于暴露状态。
What Actually Happened
实际发生了什么
You look at the details. The compromised key belongs to app-integration-user. A user who was supposed to only send emails via SES. Instead, someone called GetCallerIdentity from IP 198.51.100.50 at 12:11 UTC. (If the compromised key is your root account’s access key: this is a P1 incident. Root cannot be restricted by IAM policies. Rotate immediately, audit all root activity in the last 30+ days, and contact AWS Security right now.)
你查看了详细信息。被泄露的密钥属于 app-integration-user,该用户本应仅用于通过 SES 发送电子邮件。然而,有人在 12:11 UTC 从 IP 198.51.100.50 调用了 GetCallerIdentity。(如果泄露的是根账户的访问密钥:这是一个 P1 级紧急事件。根账户无法受到 IAM 策略的限制。请立即轮换,审计过去 30 天以上的所有根账户活动,并立即联系 AWS 安全团队。)
That one call tells you: The key was exfiltrated (not guessed in a bruteforce). The attacker tested it immediately to confirm it works. They got basic information about your account and role. The next calls happened after that test. Now you need to answer: What did they do next? This is where the 4-step plan breaks down. AWS doesn’t tell you how to find that out if your logs aren’t ready. 这一调用告诉你:密钥是被窃取的(而非暴力破解猜出的)。攻击者立即进行了测试以确认其有效性。他们获取了关于你账户和角色的基本信息。后续的调用都是在测试之后发生的。现在你需要回答:他们接下来做了什么?这就是四步计划失效的地方。如果你的日志未就绪,AWS 并未告诉你如何查明真相。
The Three Things That Actually Save You
真正能拯救你的三件事
1. Access to CloudTrail, Even If It’s Basic
1. 访问 CloudTrail,即使只是基础版
If CloudTrail is off or inaccessible, you’re blind. You can’t answer the question: What happened after that GetCallerIdentity call? If CloudTrail is on, you can use the CLI to look up events. It’s not glamorous, but it works. It shows you the sequence: GetCallerIdentity → what came next.
如果 CloudTrail 关闭或无法访问,你就成了“睁眼瞎”。你无法回答这个问题:在 GetCallerIdentity 调用之后发生了什么?如果 CloudTrail 已开启,你可以使用 CLI 查询事件。虽然不够直观,但它有效。它能向你展示操作序列:GetCallerIdentity → 接下来发生了什么。
From a typical reconnaissance scenario, that query might show: GetCallerIdentity → ListUsers → ListAccessKeys → ListRoles → ListPolicies → GetUser. The attacker was doing reconnaissance. They’re mapping your account structure. That tells you what they might do next: assume the admin role, create a backdoor key, or escalate. Without CloudTrail, you’re guessing. With CloudTrail, even basic, you have facts.
在典型的侦察场景中,查询结果可能显示:GetCallerIdentity → ListUsers → ListAccessKeys → ListRoles → ListPolicies → GetUser。攻击者正在进行侦察,他们正在绘制你的账户结构图。这告诉你他们下一步可能做什么:扮演管理员角色、创建后门密钥或进行权限提升。没有 CloudTrail,你只能靠猜;有了 CloudTrail,即使是基础版,你也能掌握事实。
2. A Playbook
2. 应急预案 (Playbook)
The four AWS steps are necessary but insufficient. A playbook is what you execute while following those steps. A minimal playbook includes: AWS 的四个步骤是必要的,但还不够。应急预案是你执行这些步骤时所遵循的指南。一个最基本的预案包括:
- Immediate (first 30 minutes): Do NOT delete the exposed key yet. Mark it as inactive. Query CloudTrail for all events from that key in the last 30 days. Check if that key was used to assume any roles. 立即行动(前 30 分钟): 不要立即删除泄露的密钥,将其标记为“禁用”。查询该密钥过去 30 天内的所有 CloudTrail 事件。检查该密钥是否被用于扮演任何角色。
- Investigation (first 2 hours): Look for new IAM users, roles, or policies. Check for API calls to sensitive services (RDS, Secrets Manager, KMS). Verify the alternate contact (Billing, Operations, Security) was not modified. 调查(前 2 小时): 查找是否有新的 IAM 用户、角色或策略。检查对敏感服务(RDS、Secrets Manager、KMS)的 API 调用。核实备用联系人(账单、运维、安全)是否被修改。
- Containment (2-4 hours): Once the new key is confirmed working, mark the exposed key as inactive. Delete backdoor credentials. Take snapshots for forensics. 遏制(2-4 小时): 确认新密钥正常工作后,将泄露的密钥标记为禁用。删除后门凭证。为取证保留快照。
- Post-incident (next business day): Review all other IAM users. Check S3 bucket policies, security group rules, and VPC peering. Enable MFA on all human IAM users and the root account. 事后处理(下一个工作日): 审查所有其他 IAM 用户。检查 S3 存储桶策略、安全组规则和 VPC 对等连接。为所有人类 IAM 用户和根账户启用 MFA。
3. Rotate the Key Without Breaking Your Application
3. 在不中断应用的情况下轮换密钥
Here’s the trap: the application is running in production right now. If you delete the key immediately, the application fails. If you rotate the key slowly, the attacker still has access. The solution is simple: rotate before you block. Create a new access key for the user right now. Update your application to use the new key (redeploy or restart). Test that the application works. 陷阱在于:应用目前正在生产环境中运行。如果你立即删除密钥,应用就会崩溃;如果你轮换得太慢,攻击者仍有访问权限。解决方案很简单:先轮换,后封禁。 立即为该用户创建一个新的访问密钥。更新你的应用以使用新密钥(重新部署或重启)。测试应用是否正常工作。