Great Stack to Doesn't Work #3 — Redis: "99% Cache Hit Ratio, System Down"
Great Stack to Doesn’t Work #3 — Redis: “99% Cache Hit Ratio, System Down”
糟糕的技术栈 #3 — Redis:“99% 缓存命中率,系统却挂了”
A survival guide for when everything goes wrong in production. Your Redis dashboard looks perfect. Hit ratio: 99.2%. Latency: sub-millisecond. Memory usage: 60% of available. Every metric says healthy. Then at 2:47 PM, your API starts returning 500s. Response times spike to 30 seconds. Users can’t log in. The dashboard still shows 99% hit ratio because the cache is working — it’s serving cached errors to everyone equally fast. Redis is doing exactly what you told it to do. The problem is what you told it to do.
这是一份生产环境故障时的生存指南。你的 Redis 仪表盘看起来完美无缺:命中率 99.2%,延迟低于 1 毫秒,内存占用 60%。所有指标都显示系统健康。然而在下午 2:47,API 开始返回 500 错误,响应时间飙升至 30 秒,用户无法登录。仪表盘依然显示 99% 的命中率,因为缓存确实在工作——它只是以极快的速度向所有人提供缓存的错误信息。Redis 正在严格执行你的指令,问题在于你给出的指令本身。
Why Single-Threaded Is Fast (Until It Isn’t)
为什么单线程很快(直到它变慢)
Redis processes commands on a single thread. No locks. No context switching. No synchronization overhead. One CPU core, fully utilized, can handle 100K+ operations per second because it never waits for another thread to release a lock. The event loop model (similar to Node.js) multiplexes thousands of client connections on a single thread using non-blocking I/O. Read a request, process it, write the response, move to the next.
Redis 在单线程上处理命令。没有锁,没有上下文切换,没有同步开销。一个充分利用的 CPU 核心每秒可以处理超过 10 万次操作,因为它从不需要等待其他线程释放锁。事件循环模型(类似于 Node.js)通过非阻塞 I/O 在单线程上复用数千个客户端连接:读取请求、处理、写入响应,然后处理下一个。
When your commands are simple — GET, SET, INCR — each one takes microseconds. The trap: slow commands block everything. KEYS * on a million-key database? That’s a full keyspace scan on the main thread. While it runs, every other client waits. SORT on a large set? Same. LRANGE on a list with 10 million elements? Same. Redis 6.0 introduced I/O threading (io-threads config) for reading and writing network data on multiple threads, but command execution is still single-threaded. Redis 7.0 improved this further, but the fundamental model hasn’t changed. Long-running commands on the main thread stall everything.
当命令很简单(如 GET、SET、INCR)时,每个命令只需微秒级时间。陷阱在于:慢命令会阻塞一切。在百万级键值的数据库上执行 KEYS *?这会在主线程上进行全键空间扫描,运行期间所有其他客户端都在等待。对大型集合执行 SORT?同样如此。对包含 1000 万元素的列表执行 LRANGE?也是一样。Redis 6.0 引入了 I/O 线程(io-threads 配置)来多线程读写网络数据,但命令执行仍然是单线程的。Redis 7.0 对此进行了改进,但基本模型未变。主线程上的长耗时命令会阻塞一切。
Rules: Never use KEYS in production. Use SCAN instead — it’s cursor-based and returns results incrementally. Watch out for O(N) commands on large data structures: LRANGE, SMEMBERS, HGETALL on million-element structures. Use SLOWLOG to find commands that are blocking the event loop.
规则: 永远不要在生产环境使用 KEYS。请改用 SCAN——它是基于游标的,可以增量返回结果。警惕大型数据结构上的 O(N) 命令:如百万级元素结构上的 LRANGE、SMEMBERS、HGETALL。使用 SLOWLOG 来查找阻塞事件循环的命令。
Pipelining: The Easiest 10x You’ll Ever Get
管道(Pipelining):最容易获得的 10 倍性能提升
Every Redis command involves a network round trip: send request, wait for response. If you’re executing 100 commands sequentially, that’s 100 round trips. At 0.5ms per round trip, you’re waiting 50ms for what should take 1ms of actual processing. Pipelining batches commands into a single network write and reads all responses at once.
每个 Redis 命令都涉及一次网络往返:发送请求,等待响应。如果你顺序执行 100 个命令,那就是 100 次往返。如果每次往返耗时 0.5 毫秒,你将花费 50 毫秒来等待本应只需 1 毫秒的实际处理。管道将命令批量处理为一次网络写入,并一次性读取所有响应。
pipe = redis.pipeline()
for user_id in user_ids:
pipe.get(f"user:{user_id}:profile")
results = pipe.execute()
Instead of 100 round trips, you make 1. The server processes all commands in sequence (it’s single-threaded, remember) and buffers the responses. Your client sends the batch, waits once, and gets everything back. Pipelining doesn’t reduce server-side processing time — each command still runs individually. It eliminates network latency, which is almost always the dominant cost for simple commands. The catch: if one command in the pipeline fails, the others still execute. Pipelining is not transactional. If you need atomicity, use MULTI/EXEC or Lua scripts.
你只需进行 1 次往返,而不是 100 次。服务器按顺序处理所有命令(记住它是单线程的)并缓冲响应。客户端发送批处理,等待一次,然后取回所有结果。管道不会减少服务器端的处理时间——每个命令仍然是单独运行的。它消除的是网络延迟,而这通常是简单命令中最主要的成本。注意:如果管道中的一个命令失败,其他命令仍会执行。管道不是事务性的。如果你需要原子性,请使用 MULTI/EXEC 或 Lua 脚本。
Lua Scripting: Atomic Operations Without the Complexity
Lua 脚本:无需复杂性的原子操作
Redis evaluates Lua scripts atomically. While a script runs, nothing else executes. This makes Lua scripts the right tool for read-modify-write operations that would otherwise need distributed locking. Classic example — rate limiting:
Redis 原子性地执行 Lua 脚本。脚本运行时,不会执行其他任何操作。这使得 Lua 脚本成为处理“读-改-写”操作的利器,否则这些操作通常需要分布式锁。经典示例——限流:
-- KEYS[1] = rate limit key
-- ARGV[1] = max requests
-- ARGV[2] = window in seconds
local current = redis.call('INCR', KEYS[1])
if current == 1 then
redis.call('EXPIRE', KEYS[1], ARGV[2])
end
if current > tonumber(ARGV[1]) then
return 0 -- rate limited
end
return 1 -- allowed
This increments a counter and sets expiry atomically. No race condition between INCR and EXPIRE. No chance of two requests both reading “0” and both thinking they’re first. Use EVALSHA instead of EVAL in production. EVALSHA references the script by its SHA1 hash, avoiding sending the full script text with every call. Load the script once with SCRIPT LOAD, then call it by hash. Caveat: Lua scripts block the main thread for their entire duration. Keep them short. A script that queries 10 keys is fine. A script that iterates over 100,000 keys is a production incident waiting to happen.
这段代码原子性地递增计数器并设置过期时间。INCR 和 EXPIRE 之间不存在竞态条件,也不会出现两个请求同时读到“0”并都认为自己是第一个的情况。在生产环境中使用 EVALSHA 代替 EVAL。EVALSHA 通过 SHA1 哈希引用脚本,避免每次调用都发送完整的脚本内容。使用 SCRIPT LOAD 加载一次脚本,然后通过哈希调用。警告:Lua 脚本在执行期间会阻塞主线程。请保持脚本简短。查询 10 个键的脚本没问题,但遍历 10 万个键的脚本就是一场等待发生的生产事故。
Pub/Sub vs Streams: Two Very Different Tools
Pub/Sub 与 Streams:两种截然不同的工具
Pub/Sub is fire-and-forget. Publisher sends a message, all connected subscribers receive it instantly. If a subscriber disconnects and reconnects, it misses everything published while it was gone. No message persistence. No consumer groups. No acknowledgment. Use Pub/Sub for: real-time notifications where missing a message is acceptable. Chat typing indicators. Cache invalidation signals. Dashboard live updates.
Pub/Sub 是“即发即弃”模式。发布者发送消息,所有已连接的订阅者立即收到。如果订阅者断开连接并重新连接,它会丢失期间发布的所有消息。没有消息持久化,没有消费者组,没有确认机制。适用于:丢失消息也无妨的实时通知、聊天输入状态提示、缓存失效信号、仪表盘实时更新。
Streams (introduced in Redis 5.0) are persistent, append-only logs with consumer groups. Think of them as “Kafka Lite inside Redis.”
Streams(Redis 5.0 引入)是带有消费者组的持久化、仅追加日志。可以将其视为“Redis 内置的轻量级 Kafka”。
XADD orders * user_id 42 amount 99.99
XREADGROUP GROUP payment_processors consumer_1 COUNT 10 BLOCK 5000 STREAMS orders >
XACK orders payment_processors 1234567890-0
Streams persist messages. Consumer groups track which consumer has read what. Unacknowledged messages can be claimed by other consumers if one dies. You get at-least-once delivery semantics. Use Streams for: job queues, event sourcing, lightweight message processing where you don’t want to deploy Kafka but need more than Pub/Sub. Don’t use Streams to replace Kafka at scale. Redis Streams are bounded by single-node memory. Kafka is designed for multi-broker distributed throughput. Different tools, different scale.
Streams 会持久化消息。消费者组会跟踪哪个消费者读取了什么。如果某个消费者挂掉,未确认的消息可以被其他消费者认领。你获得了“至少一次”的投递语义。适用于:作业队列、事件溯源、以及那些不想部署 Kafka 但又需要比 Pub/Sub 更强功能的轻量级消息处理。不要在大规模场景下用 Streams 取代 Kafka。Redis Streams 受限于单节点内存,而 Kafka 是为多代理分布式吞吐量设计的。工具不同,规模也不同。
Memory Eviction: The Policy That Saves or Kills You
内存淘汰:决定生死存亡的策略
When Redis hits maxmemory, it needs to decide what to delete. The eviction policy determines what goes.
当 Redis 达到 maxmemory 时,它必须决定删除什么。淘汰策略决定了删除的内容。
-
noeviction: Redis returns errors for write commands. Reads still work. Use this when you absolutely cannot lose data and you’d rather fail loudly than silently corrupt your cache. Common for session stores.
-
allkeys-lru: Evicts the least recently used key across all keys. The safest general-purpose policy. If you’re using Redis purely as a cache, this is your default.
-
volatile-lru: Only evicts keys with a TTL set. Keys without TTL are never evicted. Use this wh…
-
noeviction: Redis 对写命令返回错误,读操作仍可工作。当你绝对不能丢失数据,且宁愿显式报错也不愿静默破坏缓存时使用。常用于会话存储。
-
allkeys-lru: 淘汰所有键中最久未使用的键。这是最安全的通用策略。如果你纯粹将 Redis 用作缓存,这是你的默认选择。
-
volatile-lru: 仅淘汰设置了 TTL 的键。没有 TTL 的键永远不会被淘汰。使用此策略时……