Noise infusion banned from statistical products published by Census Bureau

Noise infusion banned from statistical products published by Census Bureau

美国人口普查局发布的统计产品被禁止使用“噪声注入”技术

Last week, the United States Department of Commerce issued an order declaring that “noise infusion” will be banned from all statistical products published by the Census Bureau and the Bureau of Economic Analysis. What does it mean, and why should you care? 上周,美国商务部发布了一项命令,宣布禁止在人口普查局和经济分析局发布的所有统计产品中使用“噪声注入”(noise infusion)。这意味着什么?为什么你应该关注它?

Context: Statistical products are a bunch of numbers published from a secret dataset. Often, that dataset contains confidential information, and it is important that the numbers don’t reveal that information. The U.S. Census is a well-known example: the statistics are made public, but the contents of each form filled by individual U.S. residents must stay secret. 背景:统计产品是从机密数据集中发布的一系列数字。通常,这些数据集包含机密信息,因此确保这些数字不会泄露相关信息至关重要。美国人口普查就是一个典型的例子:统计数据是公开的,但每位美国居民填写的表格内容必须保密。

Scientists have developed a number of techniques that can be used to publish useful statistics while protecting the privacy of the original data. This field is called disclosure avoidance in statistical communities. Here are a few of these techniques: 科学家们已经开发出多种技术,可以在保护原始数据隐私的同时发布有用的统计数据。在统计学界,这一领域被称为“披露规避”(disclosure avoidance)。以下是其中几种技术:

  • Suppression: removing data that doesn’t pass certain thresholds (e.g. if a count of people is below 5, we don’t publish it).
  • 抑制(Suppression): 移除未达到特定阈值的数据(例如,如果某类人群的统计数量低于 5 人,则不予发布)。
  • Coarsening (or generalization): making data attributes less precise (e.g. transform a county into its state, a date of birth into an age range, etc.).
  • 粗化(或泛化,Coarsening/Generalization): 降低数据属性的精确度(例如,将县级数据转换为州级,将出生日期转换为年龄范围等)。
  • Sampling: randomly removing some records from the dataset.
  • 抽样(Sampling): 从数据集中随机移除部分记录。
  • Swapping: taking attributes from different records and exchanging them randomly.
  • 交换(Swapping): 从不同记录中提取属性并进行随机交换。
  • Contribution bounding: making sure that a single individual cannot contribute “too much” to a statistic by limiting their maximum impact.
  • 贡献限制(Contribution bounding): 通过限制单个个体的最大影响,确保其不会对统计结果产生“过大”的贡献。
  • Noise addition: adding a random number to statistics to hide their true value.
  • 噪声添加(Noise addition): 在统计数据中加入随机数,以隐藏其真实值。

Some of these techniques, when combined, achieve a definition called differential privacy. This definition has a lot of nice fundamental properties and is widely considered the gold standard of privacy protection among scientists. To achieve it, scientists typically rely on a combination of contribution bounding and carefully-calibrated noise addition. 其中一些技术结合使用时,可以实现所谓的“差分隐私”(differential privacy)。这一定义具有许多优良的基本属性,被科学家们广泛认为是隐私保护的黄金标准。为了实现这一目标,科学家通常依赖于贡献限制和经过精确校准的噪声添加的组合。

From 1990 to 2010, the U.S. Census Bureau primarily relied on swapping for the decennial census. Then, they realized that this technique was actually very unsafe, and that it was pretty easy to reconstruct individual records using the published statistics. This is bad, because the Bureau is required by federal law to keep these records confidential. 从 1990 年到 2010 年,美国人口普查局在十年一次的人口普查中主要依赖“交换”技术。后来他们意识到,这种技术实际上非常不安全,利用已发布的统计数据很容易重建个人记录。这很糟糕,因为联邦法律要求人口普查局必须对这些记录保密。

So they tried a few alternative approaches, and decided to adopt differential privacy for the 2020 Census: this was the one that kept the statistics most useful, while preventing these attacks. It bears repeating: differential privacy wasn’t chosen because the math was nice and compelling. It was selected because among the different options that mitigated the attack, it was the one that preserved the most utility. 因此,他们尝试了几种替代方案,并决定在 2020 年人口普查中采用差分隐私:这是在防止此类攻击的同时,保持统计数据最有用的方案。值得重申的是:选择差分隐私并非因为其数学原理优美且令人信服,而是因为在所有能够缓解攻击的选项中,它是保留数据效用最高的一种。

Its exact privacy parameters were chosen not because they provided rock-solid provable guarantees, but because they squeezed most usefulness out of the data while reaching an acceptable level of privacy protection. Sadly, “preserved the most utility under newly-discovered privacy constraints” did not mean “preserved as much utility as the 2010 Census”: the numbers got less accurate, and the inaccuracies got a lot more transparent, and therefore impossible to ignore. 其精确的隐私参数之所以被选中,并非因为它们提供了坚不可摧的可证明保证,而是因为它们在达到可接受的隐私保护水平的同时,最大限度地榨取了数据的效用。遗憾的是,“在新的隐私约束下保留最高效用”并不意味着“保留了与 2010 年人口普查一样多的效用”:数据变得不那么准确了,而且这种不准确性变得更加透明,因此无法被忽视。

This made a number of people very angry. Demographers and social scientists could no longer ignore that the data they were working with was noisy data. This required a major shift in how they conceptualized and worked with this data. People who were using Census data to actually reconstruct records could no longer do so. Demographers admitted that this was common practice. It’s also an open secret that this was done by political operatives as part of gerrymandering efforts. 这让许多人感到非常愤怒。人口统计学家和社会科学家再也无法忽视他们所处理的数据是带有噪声的数据。这要求他们必须在概念化和处理这些数据的方式上做出重大转变。那些利用人口普查数据来重建个人记录的人再也无法这样做了。人口统计学家承认这曾是惯例。这也是一个公开的秘密:政治操盘手曾将其作为不公正划分选区(gerrymandering)工作的一部分。

Phew, that was a lot of context. What does the order say? The administration has now decided that noise infusion was no longer an acceptable disclosure avoidance technique. The order clearly targets differential privacy, but also seems to impact other techniques that involve randomness: the text explicitly mentions that coarsening should always be preferred, falling back to suppression as a “last resort”. 呼,背景介绍到此为止。该命令说了什么?政府现已决定,“噪声注入”不再是一种可接受的披露规避技术。该命令明确针对差分隐私,但也似乎影响了其他涉及随机性的技术:文中明确提到应始终优先考虑“粗化”,并将“抑制”作为“最后手段”。

I have no idea why the order is so specific. Maybe they wanted to make sure the scientists working at the U.S. Census couldn’t still use similar techniques without calling them differential privacy? The order also carefully says it “shall not be interpreted to conflict with any constitutional, statutory, regulatory, or other legal provision”. So the confidentiality obligations surrounding these statistical products still apply. 我不知道为什么该命令如此具体。也许他们想确保在美国人口普查局工作的科学家不能在不称其为“差分隐私”的情况下继续使用类似技术?该命令还谨慎地声明,它“不得被解释为与任何宪法、法规、监管或其他法律规定相冲突”。因此,围绕这些统计产品的保密义务仍然适用。

What will it mean in practice? The consequences will be dire for utility or for privacy, and possibly both. It’s hard to understate this point: future statistical releases will either be useless compared to past ones, or they will be incredibly unsafe. 在实践中这意味着什么?其后果对数据效用或隐私保护(甚至两者)都将是灾难性的。这一点怎么强调都不为过:未来的统计发布要么与过去相比毫无用处,要么将变得极其不安全。

For starters, taking away useful tools from the disclosure avoidance toolbox will always lead to more painful privacy/utility trade-offs. The whole point of this research field is to better understand and quantify privacy risk, and develop better tools to mitigate this risk while preserving utility. For statistical releases, differential privacy is simply the best tool we have right now. It provides a finer way of quantifying trade-offs, and allows us to get more utility out of the data than competing techniques at similar privacy levels. 首先,从披露规避的工具箱中拿走有用的工具,总是会导致更痛苦的隐私与效用权衡。这一研究领域的全部意义在于更好地理解和量化隐私风险,并开发更好的工具来在保持效用的同时降低这种风险。对于统计发布而言,差分隐私无疑是我们目前拥有的最佳工具。它提供了一种更精细的权衡量化方式,并使我们在相似的隐私水平下,能比竞争技术从数据中获得更多的效用。

If you take it away, you’re left with techniques that either have worse utility at similar levels of privacy, or worse privacy for the same utility. But all competing techniques also rely on noise addition. The Cell Key method, used at other statistical agencies, adds noise to statistics. Swapping, used from 1990 to 2010 for the U.S. Census, also injects randomness into the process. Sampling is everywhere in statistical work. Hell, even imputation technically adds noise to the data! By contrast, coarsening and suppression are very blunt instruments. They only work in situations where the statistics are already very coarse, and not too many of them are published. 如果你拿走它,剩下的技术要么在相似的隐私水平下效用更差,要么在相同的效用下隐私保护更差。但所有竞争技术也都依赖于噪声添加。其他统计机构使用的“单元键”(Cell Key)方法会在统计数据中添加噪声。1990 年至 2010 年美国人口普查使用的“交换”技术也在过程中注入了随机性。抽样在统计工作中无处不在。见鬼,甚至“插补”(imputation)在技术上也向数据中添加了噪声!相比之下,“粗化”和“抑制”是非常粗糙的工具。它们仅适用于统计数据本身已经非常粗略且发布数量不多的情况。