I cut my AWS bill by 93% by ditching Fargate for a single Lightsail VM
I cut my AWS bill by 93% by ditching Fargate for a single Lightsail VM
我通过放弃 Fargate 转用单台 Lightsail 虚拟机,将 AWS 账单削减了 93%
TL;DR I built ToolMango, an AI tools directory, on AWS Fargate. The bill came back at $345/mo before traffic. I migrated to a single $12 Lightsail VM in an afternoon and cut costs by 93% while keeping the same Next.js + Postgres + Redis + BullMQ stack alive. Here’s exactly what I changed, what broke, and what I’d do differently. 简而言之,我用 AWS Fargate 构建了一个 AI 工具目录网站 ToolMango。在没有任何流量的情况下,每月的账单高达 345 美元。我花了一个下午的时间将其迁移到一台 12 美元的 Lightsail 虚拟机上,在保留原有 Next.js + Postgres + Redis + BullMQ 技术栈的同时,将成本降低了 93%。以下是我所做的具体更改、遇到的问题以及我未来会采取的不同做法。
What ToolMango is (so the cost numbers make sense)
ToolMango 是什么(以便理解这些成本数字)
ToolMango is an editorial directory of AI tools. It scores tools on an ROI Score (cost, time-to-value, output quality, free-tier generosity, category fit, reader engagement) and ranks them — before knowing whether the tool has an affiliate program. Tools we don’t earn from frequently outrank tools we do. ToolMango 是一个 AI 工具的编辑目录。它根据 ROI 分数(成本、价值实现时间、输出质量、免费层级慷慨度、类别契合度、读者参与度)对工具进行评分并排名——而且是在不知道该工具是否有联盟营销计划的情况下进行的。我们无法从中获利的工具,排名往往高于我们能获利的工具。
Tech stack: 技术栈:
- Next.js 14 App Router
- Postgres 16
- Redis (BullMQ for the agent job queue)
- Anthropic Claude Sonnet for editorial agents (research, SEO sweep, social drafts)
- A worker process running 5 cron schedules
- Pre-revenue. Brand new domain. ~106 tools indexed at the time of writing.
- Next.js 14 App Router
- Postgres 16
- Redis(用于代理任务队列的 BullMQ)
- 用于编辑代理(研究、SEO 扫描、社交媒体草稿)的 Anthropic Claude Sonnet
- 一个运行 5 个定时任务的 worker 进程
- 尚未盈利。全新域名。撰写本文时已收录约 106 个工具。
The original Fargate setup
最初的 Fargate 设置
I started on AWS because I had CDK boilerplate from another project. The architecture was over-engineered for a directory site getting zero traffic: 我选择 AWS 是因为我手头有来自其他项目的 CDK 样板代码。对于一个零流量的目录网站来说,这种架构属于过度设计:
- CloudFront → ALB → Fargate (web ×2 tasks, worker ×1)
- ↓ Aurora Serverless v2 (writer)
- ElastiCache (Redis, t4g.small ×2)
- NAT ×2 (multi-AZ)
- VPC + interface endpoints
- WAF (managed rule sets)
The CDK code is clean. It deploys with one command. It autoscales. It survives an AZ failure. It’s exactly what a series-A SaaS would run. It’s also $345/mo for zero users. CDK 代码很整洁。一键部署,支持自动扩缩容,能抵御可用区(AZ)故障。这完全是 A 轮融资 SaaS 公司会用的架构。但对于零用户的情况,它每月要花费 345 美元。
What was actually costing money
到底是什么在花钱
I broke it down with aws ce get-cost-and-usage and a few aws ecs describe-task-definition calls:
我通过 aws ce get-cost-and-usage 和几次 aws ecs describe-task-definition 调用进行了拆解:
| Resource | $/mo |
|---|---|
| Aurora Serverless v2 (no auto-pause, 0.5 ACU min) | $86 |
| Fargate ARM64 (3 tasks: 2× web at 1vCPU/2GB + 1× worker at 0.5/1GB) | $71 |
| 2× NAT Gateways (multi-AZ) | $65 |
| VPC interface endpoints (Secrets Manager × 3 AZ + others) | $40 |
| ALB + WAF | $34 |
| CloudWatch + Container Insights | $15 |
| Public IPv4 charges | $15 |
| ElastiCache (cache.t4g.small ×2 nodes) | $11 |
| Misc (CloudFront, Secrets, Route53, S3) | $8 |
The killer insight: about $87/mo of that bill is “infrastructure plumbing” — NAT, ALB, ElastiCache, VPC endpoints. None of it is doing real work for the application. It’s all there to support the architecture itself. That’s the floor on a Fargate setup. For a pre-revenue project, it’s nuts. 关键发现:账单中约 87 美元/月是“基础设施管道”费用——NAT、ALB、ElastiCache、VPC 终端节点。它们没有为应用程序做任何实际工作,仅仅是为了支撑架构本身而存在。这就是 Fargate 设置的成本底线。对于一个尚未盈利的项目来说,这太疯狂了。
Phase 1: Skeleton mode on AWS
第一阶段:AWS 上的“骨架模式”
Before migrating, I tried to make Fargate cheap. CDK changes I shipped: 在迁移之前,我尝试降低 Fargate 的成本。我提交的 CDK 修改包括:
// Aurora: enable auto-pause when idle
const cfnCluster = cluster.node.defaultChild as rds.CfnDBCluster;
cfnCluster.serverlessV2ScalingConfiguration = {
minCapacity: 0, // was 0.5 — auto-pause after 5 min idle
maxCapacity: 2, // was 4
secondsUntilAutoPause: 300,
};
// Network: 1 NAT instead of 2
natGateways: 1, // was 2 (multi-AZ)
// Web: smaller, fewer tasks, autoscale up if needed
desiredCount: 1, // was 2
cpu: 512, // was 1024
memoryLimitMiB: 1024, // was 2048
// Worker on Fargate Spot
capacityProviderStrategies: [
{ capacityProvider: "FARGATE_SPOT", weight: 4 },
{ capacityProvider: "FARGATE", weight: 1 },
],
// Container Insights off
containerInsightsV2: ecs.ContainerInsights.DISABLED,
// Backup retention
backup: { retention: cdk.Duration.days(1) }, // was 14
// WAF: removed entirely (CloudFront has free Shield Standard)
Result: $345/mo → ~$140/mo. Better, but still ridiculous for a pre-revenue project. The reason it stopped at $140: NAT, ALB, ElastiCache, VPC endpoints, and Aurora storage all have hard floors. You can’t make Fargate genuinely cheap because the architecture itself isn’t designed for cheap. 结果:从 345 美元/月降至约 140 美元/月。虽然好了一些,但对于一个尚未盈利的项目来说仍然离谱。停留在 140 美元的原因是:NAT、ALB、ElastiCache、VPC 终端节点和 Aurora 存储都有硬性成本底线。你无法让 Fargate 真正便宜,因为这种架构本身就不是为低成本设计的。
Phase 2: The honest migration
第二阶段:彻底的迁移
Lightsail is AWS’s “give me a Linux VM and stop overthinking it” tier. $12/mo for 2 vCPU, 2GB RAM, 60GB SSD, 3TB transfer — and it includes a static IP and a firewall. The plan: run everything on one VM in Docker Compose. Lightsail 是 AWS 的“给我一台 Linux 虚拟机,别想太多”层级。每月 12 美元,包含 2 vCPU、2GB 内存、60GB SSD、3TB 流量,还附带静态 IP 和防火墙。计划是:在单台虚拟机上通过 Docker Compose 运行所有服务。
(Docker Compose configuration omitted for brevity) (此处省略 Docker Compose 配置以保持简洁)
For HTTPS termination: Caddy, which auto-issues Let’s Encrypt certs on first request. Configuration is one stanza: 关于 HTTPS 终止:使用 Caddy,它会在首次请求时自动签发 Let’s Encrypt 证书。配置只需一段:
toolmango.com, www.toolmango.com {
reverse_proxy 127.0.0.1:3000
encode gzip zstd
header {
Strict-Transport-Security "max-age=31536000; includeSubDomains; preload"
X-Content-Type-Options "nosniff"
}
}
Caddy reloads, Caddy gets the cert. Total setup time: 30 seconds. Caddy 重载,获取证书。总设置时间:30 秒。
Migrating Aurora data to local Postgres
将 Aurora 数据迁移到本地 Postgres
Aurora is in a private subnet (PRIVATE_ISOLATED), so I couldn’t pg_dump from outside. The workaround: spin up a one-off ECS Fargate task in the existing Web’s VPC that runs pg_dump and uploads to S3.
Aurora 位于私有子网(PRIVATE_ISOLATED)中,所以我无法从外部执行 pg_dump。变通方法是:在现有的 Web VPC 中启动一个一次性的 ECS Fargate 任务,运行 pg_dump 并上传到 S3。
(Command details omitted) (此处省略命令细节)
On the Lightsail VM, pull from S3 (via a presigned URL since Lightsail VMs don’t have IAM roles by default), gunzip, and pipe into the local Postgres container: 在 Lightsail 虚拟机上,从 S3 拉取数据(通过预签名 URL,因为 Lightsail 虚拟机默认没有 IAM 角色),解压,并通过管道传输到本地 Postgres 容器中:
gunzip -c /tmp/dump.sql.gz | docker compose exec -T postgres psql -U tmadmin -d toolmango
64 published tools transferred cleanly. ~485KB of data total (it’s a directory site). 64 个已发布的工具顺利迁移。总数据量约 485KB(毕竟是个目录网站)。