How to Mathematically Choose the Optimal Bins for Your Histogram

如何通过数学方法为直方图选择最优分箱

Data Science: How to Mathematically Choose the Optimal Bins for Your Histogram 数据科学：如何通过数学方法为直方图选择最优分箱

Optimal resolution in histograms: a rigorous Bayesian approach to density fitting 直方图的最优分辨率：一种严谨的密度拟合贝叶斯方法

Fetze Pijlman | May 23, 2026 | 10 min read Fetze Pijlman | 2026年5月23日 | 阅读需10分钟

On the left a standard density with fixed resolution, on the top right a density with adaptive resolution, on the bottom right a density with adaptive non-equal bins and including an uncertainty quantification. 左图为固定分辨率的标准密度图，右上图为自适应分辨率密度图，右下图为包含不确定性量化的自适应不等宽分箱密度图。

Abstract

摘要

Have you ever wondered how to choose your bins in a histogram? Did you ever ask yourself whether there are deeper reasons for choices that go beyond that it just looks nice? While histograms are the most fundamental tool for data visualization, setting their resolution is important, especially when the histogram itself is used for further analyses. 你是否曾想过该如何选择直方图的分箱？你是否问过自己，除了“看起来美观”之外，这些选择背后是否有更深层的逻辑？虽然直方图是数据可视化中最基础的工具，但设置其分辨率至关重要，尤其是当直方图本身被用于后续分析时。

Histograms are often computed to visualize the density of the data. In this post, we explore the mathematics of density fitting, specifically looking at how bins should shrink as our dataset grows. Inspired by adjacent fields such as perturbation theory in physics and Taylor expansions in mathematics, we will find a rigorous method for constructing densities. 计算直方图通常是为了可视化数据密度。在本文中，我们将探讨密度拟合的数学原理，特别是研究随着数据集增长，分箱应如何缩小。受物理学中的微扰理论和数学中的泰勒展开等相关领域的启发，我们将找到一种构建密度的严谨方法。

All images are by the author 所有图片均由作者提供

Background: Approximations

背景：近似

The intuition is simple: the more data you have, the more detail you should be able to see. If you are looking at a sample of ten observations, two or three wide bins are likely all you can afford before your visualization becomes a sparse collection of empty gaps. But if you have ten million observations, those wide bins start to feel like a low-resolution pixelated photograph. You want to “zoom in” by increasing the number of bins. The question, however, is: How exactly should we scale this resolution? 直觉很简单：拥有的数据越多，你应该能看到的细节就越多。如果你观察的是十个样本，那么在可视化效果变得稀疏且充满空白之前，你可能最多只能承受两到三个宽分箱。但如果你有一千万个观测值，那些宽分箱看起来就像一张低分辨率的像素化照片。你需要通过增加分箱数量来“放大”图像。然而问题在于：我们究竟该如何调整这种分辨率？

In physics, when we face a system that is too complex to solve exactly, we often turn to Perturbation Theory. In Quantum Electrodynamics (QED), for example, we approximate complex interactions by expanding them in terms of a small coupling constant, like the electron charge e. This “interaction strength” provides a natural hierarchy for our approximations. But for a histogram, what is the analogous “charge”? Is there a fundamental parameter that governs the interaction between our discrete data points and the underlying distribution we are trying to estimate? 在物理学中，当我们面对一个过于复杂而无法精确求解的系统时，通常会转向微扰理论。例如，在量子电动力学（QED）中，我们通过以电子电荷 e 等小耦合常数为项进行展开，来近似复杂的相互作用。这种“相互作用强度”为我们的近似提供了一个自然的层级。但对于直方图而言，什么是类似的“电荷”？是否存在一个基本参数，能够控制离散数据点与我们试图估计的潜在分布之间的相互作用？

Mathematics offers another path: the Taylor Expansion. If we assume the underlying density function is sufficiently smooth (analytical), we can describe it locally using its derivatives. This feels like a promising lead as the higher orders can be demonstrated to vanish. Although we may want to accept a restriction to analytical distributions, it is not clear how this leads to a certain bin size. 数学提供了另一条路径：泰勒展开。如果我们假设潜在的密度函数足够平滑（解析的），我们就可以利用其导数在局部对其进行描述。这看起来是一个很有希望的方向，因为高阶项可以被证明趋于消失。尽管我们可能愿意接受对解析分布的限制，但目前尚不清楚这如何推导出具体的分箱大小。

Alternatively, we might treat the problem as an Expansion in Basis Functions. Just like we can represent a piece-wise continuous function using a Fourier transform or Legendre polynomials, we could view histogram bins as a set of basis functions. Using such an approach we could approximate the function in terms of L2. But this approach introduces its own set of hurdles. How do we compute the coefficients for these functions efficiently? And more importantly, how do we satisfy the physical constraints of a probability density function? Unlike a general Fourier series, a density function must be strictly positive-definite and normalized to one. We will see in the following that the method obtained from information theory has similar aspects to expanding in basis functions. 或者，我们可以将该问题视为基函数展开。正如我们可以使用傅里叶变换或勒让德多项式来表示分段连续函数一样，我们也可以将直方图分箱视为一组基函数。使用这种方法，我们可以根据 L2 范数来近似该函数。但这种方法也引入了一系列障碍。我们如何高效地计算这些函数的系数？更重要的是，我们如何满足概率密度函数的物理约束？与一般的傅里叶级数不同，密度函数必须是严格正定的且归一化为 1。我们将在下文中看到，从信息论中获得的方法与基函数展开有相似之处。

Information Theory: Priors & Posteriors

信息论：先验与后验

For an introduction to Bayesian statistics or information theory, the reader is referred to (Murphy, 2022). In a Bayesian approach, a model $P(X|\theta)$, where $X$ are the observables we want to model and $\theta$ are our parameters, also contains a prior distribution $P(\theta|\mathcal{M})$ that reflects our belief on the distribution before data was observed. After the data has been observed, we can estimate the posterior distribution $P(\theta|X)$: 关于贝叶斯统计或信息论的入门，读者可参考 (Murphy, 2022)。在贝叶斯方法中，模型 $P(X|\theta)$（其中 $X$ 是我们要建模的观测值，$\theta$ 是我们的参数）还包含一个先验分布 $P(\theta|\mathcal{M})$，它反映了我们在观察数据之前对分布的信念。在观察到数据后，我们可以估计后验分布 $P(\theta|X)$：

$P(\theta|X) = P(X|\theta)P(\theta|\mathcal{M})/P(X)$

This procedure is mathematically elegant because it is 100% safe against overfitting. However, it demands a strict discipline: we are not allowed to choose our model or prior after having seen the data. If we use the data to decide which model structure to use, we break the underlying logic of the inference. 这一过程在数学上非常优雅，因为它能 100% 防止过拟合。然而，它要求严格的纪律：我们不允许在看到数据后才选择模型或先验。如果我们利用数据来决定使用哪种模型结构，就会破坏推理的底层逻辑。

The most-likely model given the data versus model weighting

给定数据下的最可能模型与模型加权

The quality of a model can be computed by considering its surprisal (see e.g. (Vries, 2026)): 模型的质量可以通过考虑其“惊奇度”（surprisal，参见 Vries, 2026）来计算：

$\log P(X|\mathcal{M}) = -\text{surprisal} = \text{accuracy} - \text{complexity}$ $\log P(X|\mathcal{M}) = -\text{惊奇度} = \text{准确度} - \text{复杂度}$

Models with an excessive number of parameters (because one may be tempted to include all kind of hypothetical interactions) may achieve an incredible accuracy, but they are “killed” by the penalty of their own complexity. The ideal model isn’t the most detailed one; it is the one that captures the most information with the least amount of unnecessary baggage. 参数过多的模型（因为人们可能倾向于包含各种假设的相互作用）或许能达到惊人的准确度，但它们会被其自身的复杂度惩罚所“扼杀”。理想的模型不是最详细的模型，而是以最少的冗余信息捕获最多信息的模型。

When considering a set of models, one can compute the likelihood of each model in comparison with the models under consideration: 在考虑一组模型时，可以计算每个模型相对于其他候选模型的似然度：

$P(\mathcal{M}_i | X) \sim P(X | \mathcal{M}_i) P(\mathcal{M}_i)$

It is tempting to simply pick the model with the highest probability and move on. But this “winner takes-all” approach carries risks: 人们很容易直接选择概率最高的模型并继续工作。但这种“赢家通吃”的方法存在风险：

Statistical Fluctuations: The data $X$ might contain a random fluke that makes a sub-optimal model look temporarily superior.
统计波动： 数据 $X$ 可能包含随机偏差，使得次优模型看起来暂时更优。
The Weight of the Crowd: Sometimes, the sum of many “less likely” models actually outweighs the probability of the single “best” model.
群体权重： 有时，许多“可能性较低”模型的总和实际上超过了单个“最佳”模型的概率。

Because of this, a more robust path is to carry all models forward, weighting them by their probability. It is important to note that this is not a “mixture” of different truths; we still assume only one model is actually true, but we use the full distribution of possibilities to account for our own uncertainty. 因此，更稳健的方法是保留所有模型，并根据其概率进行加权。需要注意的是，这并非不同真理的“混合”；我们仍然假设只有一个模型是真实的，但我们利用可能性的完整分布来解释我们自身的不确定性。

Densities

密度

A density using Bayesian approach: To treat a density as a formal model, we view each of its $K$ bins as a parameter. Specifically, we assign a weight $w_k$ to each bin, representing the probability of a data point falling into that interval. Because the total proba… 使用贝叶斯方法的密度：为了将密度视为一个正式模型，我们将它的 $K$ 个分箱中的每一个都视为一个参数。具体来说，我们为每个分箱分配一个权重 $w_k$，代表数据点落入该区间的概率。因为总概率……