A couple million lines of Haskell: Production engineering at Mercury

A Couple Million Lines of Haskell: Production Engineering at Mercury

数百万行 Haskell 代码:Mercury 的生产工程实践

Ian Duncan | March 30, 2026 | #Production #Mercury Ian Duncan | 2026年3月30日 | #生产环境 #Mercury

The editors of the Haskell Blog are happy to announce a new series of articles called “Haskellers from the trenches”, where we invite experienced engineers to talk about their subjects of expertise, best practices, and production tales. Engineering rigour and artistic creativity are a fantastic combination, and this series aims to be the synthesis of these two aspects within the Haskell world. Haskell 博客编辑部很高兴宣布推出一个名为“来自前线的 Haskell 开发者”(Haskellers from the trenches)的新系列文章。我们将邀请经验丰富的工程师分享他们的专业领域、最佳实践以及生产环境中的故事。工程严谨性与艺术创造力是绝佳的组合,本系列旨在将这两者在 Haskell 世界中融会贯通。

I first heard about Haskell when I was sixteen, sitting in a high school computer science class where we were writing Java and learning, among other things, that NullPointerException was apparently a lifestyle choice if you decided to go into software development. While looking at the /r/programming subreddit after school, I stumbled across a reference to a language where null pointer exceptions simply could not happen, where the type system could prevent an entire category of bugs that I had been fighting with every week. Haskell. I was immediately, hopelessly enamored with the idea. 我第一次听说 Haskell 是在 16 岁时,当时我正坐在高中的计算机科学课堂上写 Java。在那儿我学到的一件事是:如果你决定从事软件开发,那么 NullPointerException(空指针异常)似乎成了一种“生活方式”。放学后,我在 /r/programming 子版块浏览时,偶然发现了一种语言的介绍,据说在这种语言中,空指针异常根本不会发生,其类型系统可以预防我每周都在与之斗争的一整类 Bug。那就是 Haskell。我立刻无可救药地爱上了这个想法。

I have been writing Haskell for nearly two decades now, and I still think the value proposition I fell in love with at sixteen was basically right. What took me longer to learn is what that promise looks like after a codebase gets large, the company grows faster than its documentation, and the system is allowed to touch money. Haskell earns its keep there in numerous, sturdy ways. It lets you pack operational knowledge into APIs, put dangerous machinery behind tight boundaries, and make the safe path the easy one. At a growing company, those aren’t just matters of taste; they are how you keep a system understandable after the people who first understood it have moved on. 我写 Haskell 已经快二十年了,我依然认为我在 16 岁时爱上的那个价值主张基本上是正确的。但我花了更长时间才明白,当代码库变得庞大、公司发展速度超过文档更新速度,且系统开始处理资金时,这一承诺意味着什么。Haskell 在这些方面以多种稳健的方式证明了自己的价值。它让你能够将运维知识封装进 API,将危险的机制置于严格的边界之后,并让“安全路径”成为“简单路径”。在一家成长中的公司,这些不仅仅是个人偏好问题;当最初理解系统的人离开后,这些正是你保持系统可理解性的关键。

Fast forward to today: I work at Mercury, a fintech company that provides banking services. We serve over 300,000 businesses. We processed $248 billion in transaction volume in 2025 on $650 million in annualized revenue, and are, at the time of writing, in the process of obtaining a national bank charter in the USA from the OCC. We have around 1,500 employees. Our engineering organization largely hires generalists, and most of them have never written a line of Haskell before joining. 快进到今天:我在 Mercury 工作,这是一家提供银行服务的金融科技公司。我们服务超过 30 万家企业。2025 年,我们处理了 2480 亿美元的交易额,年化收入达 6.5 亿美元。在撰写本文时,我们正处于从美国货币监理署(OCC)获取国家银行牌照的过程中。我们拥有约 1500 名员工。我们的工程团队主要招聘通才,其中大多数人在加入前从未写过一行 Haskell 代码。

My time working at Mercury has changed how I think about the language more than any sermon about purity ever did. Elegance is pleasant, but keeping your business alive is compulsory. Our codebase is roughly 2 million lines of Haskell, once you strip out comments and such. This is the part where you are supposed to recoil in horror. A couple million lines of Haskell, maintained by people who learned the language on the job, at a company that moves huge amounts of money? The conventional wisdom says this should be a disaster, but surprisingly, it isn’t. 在 Mercury 工作的经历改变了我对这门语言的看法,其影响远超任何关于“纯粹性”的说教。优雅固然令人愉悦,但维持业务生存才是硬道理。除去注释等内容,我们的代码库大约有 200 万行 Haskell 代码。读到这里,你可能会感到震惊。数百万行 Haskell 代码,由在职学习的人维护,且公司处理着巨额资金?传统观点认为这应该是一场灾难,但令人惊讶的是,事实并非如此。

The system we’ve built has worked well for years, through hypergrowth, through the SVB crisis that sent $2 billion in new deposits our way in five days, through regulatory examinations, through all the ordinary and extraordinary things that happen to a financial system at scale. This article is about why it works. Not in the “Haskell is beautiful” sense, though it is. Not in the “the compiler will save us from ourselves” sense, though I frequently feel gratitude in that direction. I mean in the much less romantic and much more useful sense that we run this language in production, at scale, with a rapidly changing team, and have learned some hard lessons about what it takes to keep the whole enterprise afloat. 我们构建的系统多年来运行良好,经历了超高速增长、经历了硅谷银行(SVB)危机(那次危机在五天内为我们带来了 20 亿美元的新存款)、经历了监管审查,以及金融系统在大规模运行时会遇到的所有普通和非凡事件。本文旨在探讨它为何能行之有效。这并非指“Haskell 很美”那种意义(尽管它确实很美),也不是指“编译器能拯救我们”那种意义(尽管我经常为此感到庆幸)。我指的是一种不那么浪漫但更实用的意义:我们在生产环境中大规模运行这门语言,面对着快速变动的团队,并从中吸取了关于如何维持整个企业运转的深刻教训。

The beauty of Haskell is charming enough, but there is a whole swath of operational and organizational reality beyond it, and if you ignore that reality for too long, your company will likely fire the whole Haskell team and start writing PHP or something instead. Haskell 的美固然迷人,但其背后还有一大片运维和组织层面的现实。如果你忽视这些现实太久,你的公司很可能会解雇整个 Haskell 团队,转而改用 PHP 或其他语言。

How We Think About Reliability

我们如何看待可靠性

Before diving into practical advice, a note on philosophy. There is a traditional way of thinking about system reliability that focuses on preventing failures. You enumerate the things that can go wrong. You add checks. You write tests for each bad case. You hunt for bugs. This is, of course, necessary work, and we do it. But it is not sufficient, and if you orient entirely around it you develop a specific blind spot: you get very good at cataloguing the ways things break and very bad at understanding why they ordinarily work. 在深入探讨实用建议之前,先谈谈哲学。有一种关于系统可靠性的传统思维方式,侧重于预防故障。你会列举可能出错的事情,添加检查,为每种坏情况编写测试,并搜寻 Bug。这当然是必要的工作,我们也这样做。但这还不够,如果你完全围绕这一点,就会产生一个特定的盲点:你变得非常擅长归纳系统崩溃的方式,却非常不擅长理解系统为何能正常运行。

We try to think about it differently. A system operates reliably because it can absorb variation: it degrades gracefully, its operators can understand and adjust it, and the architecture makes the right thing easy and the wrong thing difficult. Reliability is not just the absence of failure. It is the presence of adaptive capacity. It is a system’s ability to keep functioning while reality continues its longstanding and regrettable habit of refusing to hold still. 我们尝试以不同的方式思考。一个系统之所以能可靠运行,是因为它能够吸收变化:它能优雅地降级,操作员能够理解并调整它,架构设计使得“做正确的事”变得简单,“做错误的事”变得困难。可靠性不仅仅是“没有故障”,它更是“适应能力”的体现。它是指当现实继续保持其长期以来令人遗憾的“拒绝静止”的习惯时,系统仍能保持正常运作的能力。

When you have hundreds of engineers working in a multi-million-line codebase, many of whom are six months into their Haskell careers, “adaptive capacity” stops being a nifty phrase from a resilience engineering paper and starts being a daily concern. Patrick McKenzie has observed that in a company growing at 2x per year, half of your coworkers will always have less than a year of experience. A year later, half of your coworkers will still have less than a year of experience. For very successful companies, this never stops being true. You become organizationally ancient very quickly, whether you like it or not, and the things you know become institutional dark matter: load-bearing, but invisible to most of the people around you. 当你有数百名工程师在数百万行的代码库中工作,且其中许多人接触 Haskell 才不过六个月时,“适应能力”就不再是弹性工程论文中一个漂亮的词汇,而成了日常关注的焦点。Patrick McKenzie 曾观察到,在一家每年增长两倍的公司里,你有一半的同事工作经验永远不足一年。一年后,你的一半同事依然经验不足一年。对于非常成功的公司来说,这一规律始终成立。无论你是否愿意,你在组织层面会迅速变得“老龄化”,而你所掌握的知识会变成“机构暗物质”:它们支撑着系统,却对周围大多数人来说是不可见的。

So the questions we ask are operational. Can the new hire on your team read this module and understand what it does? If the database is slow, does this service degrade or does it fall over and take its neighbors with it? If someone misuses an interface, does the compiler tell them, or do we find out when the on-call gets paged? If you don’t have answers to those questions, you… 因此,我们提出的问题都是运维层面的。团队里的新员工能读懂这个模块并理解它的功能吗?如果数据库变慢了,这个服务是会优雅降级,还是会崩溃并拖累其他服务?如果有人误用了接口,编译器会提示他们,还是我们要等到值班人员收到报警时才发现?如果你没有这些问题的答案,你……