LMCache / LMCache

LMCache / LMCache

A KV Cache Management Layer for Scalable LLM Inference 用于可扩展大模型推理的 KV Cache 管理层

Blog | Documentation | Join Slack | Community Meeting | Roadmap 博客 | 文档 | 加入 Slack | 社区会议 | 路线图

Updates [2026/05] 🔥 Agentic workload benchmark on AMD MI300X (blog). [2026/04] 🔥 LMCache’s new multiprocess(MP) architecture release (blog). [2026/03] LMCache at GTC 2026 (post). [2026/01] LMCache multi-node P2P CPU memory sharing, from experimental feature to production (blog). More 更新 [2026/05] 🔥 AMD MI300X 上的智能体工作负载基准测试(博客)。[2026/04] 🔥 LMCache 全新多进程 (MP) 架构发布(博客)。[2026/03] LMCache 亮相 GTC 2026(文章)。[2026/01] LMCache 多节点 P2P CPU 内存共享,从实验性功能走向生产环境(博客)。更多

[2025/11] LMCache x CoreWeave accelerate efficient LLM inference for Cohere (blog). [2025/10] LMCache joins the PyTorch Foundation and Tensormesh unveiled (blog, PyTorch). [2025/09] NVIDIA Dynamo integrates LMCache, accelerating LLM inference (blog). [2025/08] 🎉 LMCache hits 5,000+ GitHub stars (blog). [2025/08] LMCache supports gpt-oss (20B/120B) on day 1 (blog). [2025/07] Get faster LLM inference and cheaper responses with LMCache and Redis (Redis blog). [2025/07] LMCache extends its turbo-boost to multimodal models in vLLM V1 (blog). [2025/06] LLM Production Stack goes cross-hardware: AMD, Arm and Ascend (blog). [2025/11] LMCache 与 CoreWeave 携手为 Cohere 加速高效大模型推理(博客)。[2025/10] LMCache 加入 PyTorch 基金会,Tensormesh 正式亮相(博客,PyTorch)。[2025/09] NVIDIA Dynamo 集成 LMCache,加速大模型推理(博客)。[2025/08] 🎉 LMCache GitHub 星数突破 5,000(博客)。[2025/08] LMCache 首日支持 gpt-oss (20B/120B)(博客)。[2025/07] 通过 LMCache 和 Redis 实现更快的推理和更低成本的响应(Redis 博客)。[2025/07] LMCache 在 vLLM V1 中将其加速能力扩展至多模态模型(博客)。[2025/06] 大模型生产技术栈实现跨硬件支持:AMD、Arm 和昇腾(博客)。

About

关于

LMCache is a KV cache management layer for LLM inference. It turns KV cache from a temporary state into reusable AI-native knowledge that can be stored persistently, reused across multiple serving engines, monitored with an observability stack, and transformed for better generation quality. As a result, LMCache reduces TTFT (time-to-first-token) and improves throughput, especially for long-context agentic, multi-turn conversation, and knowledge-augmented workloads (e.g., RAG). LMCache 是一个用于大模型推理的 KV Cache 管理层。它将 KV Cache 从临时状态转变为可重用的 AI 原生知识,能够持久化存储、跨多个服务引擎重用、通过可观测性栈进行监控,并进行转换以提升生成质量。因此,LMCache 降低了首字延迟 (TTFT) 并提高了吞吐量,特别是在长上下文智能体、多轮对话和知识增强型工作负载(如 RAG)中表现优异。

LMCache is vendor-neutral. It can be used as a KV cache layer for a range of mainstream open-source serving engines, inference frameworks, hardware vendors, storage systems, and infrastructure providers. The vendor neutrality allows users to freely switch between serving engines and storage vendors, while reusing the stored KV caches. LMCache 是厂商中立的。它可以作为各种主流开源服务引擎、推理框架、硬件厂商、存储系统和基础设施提供商的 KV Cache 层。这种厂商中立性允许用户在不同的服务引擎和存储供应商之间自由切换,同时重用已存储的 KV Cache。

Key features

核心功能

  • Engine-independent deployment: LMCache, as a standalone daemon process, manages KV cache independently from the inference engine process, so that KV cache will not be lost even if the inference engine crashes (i.e., no fate-sharing with engines). 引擎独立部署: LMCache 作为独立的守护进程,与推理引擎进程解耦管理 KV Cache,即使推理引擎崩溃,KV Cache 也不会丢失(即与引擎无命运共同体关系)。
  • Persistent, tiered KV cache offloading and reuse: Move KV caches out of GPU memory into a tiered storage hierarchy spanning CPU memory, local storage, and remote backends, enabling reuse across requests, sessions, and engine instances to reduce repeated prefill computation and improve TTFT. 持久化、分层 KV Cache 卸载与重用: 将 KV Cache 从 GPU 内存移出,存入跨 CPU 内存、本地存储和远程后端的分层存储体系中,实现跨请求、会话和引擎实例的重用,从而减少重复的预填充计算并提升 TTFT。
  • Production-level KV cache observability: LMCache provides a rich set of KV cache observability metrics, including typical Kubernetes metrics (health monitoring, performance diagnostics), KV-cache-specific metrics (request-level and token-level prefix cache hits, lifecycle, request-level KV cache performance), management metrics (user-specific usage), and more. 生产级 KV Cache 可观测性: LMCache 提供丰富的 KV Cache 可观测性指标,包括典型的 Kubernetes 指标(健康监控、性能诊断)、KV Cache 专用指标(请求级和 Token 级前缀缓存命中率、生命周期、请求级 KV Cache 性能)、管理指标(用户特定使用量)等。
  • Pluggable storage and transport backends: Easily integrate remote storage and KV transfer backends through a unified interface, enabling KV cache offloading and sharing across storage providers. Through this interface, LMCache supports storage backends including CPU RAM, local disk (SSD), Redis/Valkey, Mooncake, InfiniStore, S3-compatible object storage, NIXL, and GDS. 可插拔存储与传输后端: 通过统一接口轻松集成远程存储和 KV 传输后端,实现跨存储提供商的 KV Cache 卸载与共享。通过该接口,LMCache 支持包括 CPU 内存、本地磁盘 (SSD)、Redis/Valkey、Mooncake、InfiniStore、S3 兼容对象存储、NIXL 和 GDS 在内的多种存储后端。
  • Non-prefix KV reuse: Extend KV reuse beyond prefix caching by reusing cached KV blocks at any position in the prompt. This leverages CacheBlend to selectively recompute tokens for quality recovery. 非前缀 KV 重用: 将 KV 重用范围从前缀缓存扩展到提示词中任意位置的缓存块重用。利用 CacheBlend 选择性地重新计算 Token 以恢复生成质量。
  • PD disaggregation and KV transfer: Support KV cache transfer from prefill workers to decode workers over NVLink, RDMA, or TCP through transport layers such as NIXL. 预填充与解码分离 (PD Disaggregation) 及 KV 传输: 支持通过 NIXL 等传输层,利用 NVLink、RDMA 或 TCP 将 KV Cache 从预填充工作节点传输到解码工作节点。
  • Pluggable KV transformation: A simple interface for researchers to write compression, token dropping, and custom serialization through a flexible SERDE interface. 可插拔 KV 转换: 提供简单的接口,供研究人员通过灵活的 SERDE 接口编写压缩、Token 丢弃和自定义序列化逻辑。

LMCache is becoming an integral layer in the LLM inference ecosystem, with community-driven integration with serving engines, inference frameworks, hardware vendors, storage systems, and infrastructure providers. LMCache 正成为大模型推理生态系统中不可或缺的一层,通过社区驱动的方式与各类服务引擎、推理框架、硬件厂商、存储系统和基础设施提供商进行集成。

Getting Started

快速入门

To use LMCache, simply install lmcache from your package manager, e.g. pip: 要使用 LMCache,只需通过包管理器安装 lmcache,例如使用 pip: pip install lmcache

For more setup options and examples, see: Installation | Quickstart | LMCache Recipes | CLI Reference | Benchmarking Guide | Production Deployment 更多设置选项和示例,请参阅:安装 | 快速入门 | LMCache 方案 | CLI 参考 | 基准测试指南 | 生产部署

Contributing

贡献

We welcome and value contributions and collaborations. Join us in improving LMCache. Check out the Contributing Guide or join our Slack community to get started. 我们欢迎并重视贡献与合作。加入我们,共同完善 LMCache。查看贡献指南或加入我们的 Slack 社区即可开始。

Adoption and Partnerships

采用与合作伙伴

LMCache has a growing community of developers, researchers, industry adopters, and partners building the next generation of efficient LLM inference systems. As an independent open-source project, LMCache is becoming the de-facto standard for KV Cache management in LLM inference. Its continued development and community work are supported in part by Tensormesh. LMCache 拥有一个不断壮大的开发者、研究人员、行业采用者和合作伙伴社区,他们正在构建下一代高效的大模型推理系统。作为一个独立的开源项目,LMCache 正成为大模型推理中 KV Cache 管理的事实标准。其持续的开发和社区工作部分由 Tensormesh 提供支持。