Privacy-Aware Infrastructure in the AI-Native Era: An Asset Classification Case Study
Privacy-Aware Infrastructure in the AI-Native Era: An Asset Classification Case Study
AI 原生时代的隐私感知基础设施:资产分类案例研究
By Rituraj Kirti, Vasileios Lakafosis 作者:Rituraj Kirti, Vasileios Lakafosis
Privacy controls — systems that enforce retention, access, allowed-purpose, downstream-sharing, or anonymization policies — require a reliable understanding of data to function. Before such a control can operate effectively, it must know exactly what it is looking at. This can be complex, as demonstrated by a field simply named “age“: In one context, it might describe a person and require strict protections, while in another, it could be a cache time-to-live (TTL) numerical value in an infrastructure pipeline. 隐私控制(即执行保留、访问、允许用途、下游共享或匿名化策略的系统)需要对数据有可靠的理解才能发挥作用。在这些控制措施有效运行之前,必须准确识别其处理的对象。这可能非常复杂,以一个名为“age”(年龄)的字段为例:在一种语境下,它可能描述一个人并需要严格保护;而在另一种语境下,它可能只是基础设施流水线中缓存生存时间(TTL)的数值。
Figure 1: One column name, two governance outcomes. The identical field age is personal data when it describes a person, but ordinary system metadata when it is a cache TTL. Which is why a name alone cannot determine the privacy requirement. 图 1:一个列名,两种治理结果。相同的“age”字段在描述人时属于个人数据,但在作为缓存 TTL 时则是普通的系统元数据。这就是为什么仅凭名称无法确定隐私要求的原因。
This is the everyday problem behind privacy-aware infrastructure (PAI): The inputs are noisy and probabilistic, but the outputs need to be precise enough to drive enforcement. AI-native products make that problem harder. They introduce new data modalities, faster iteration cycles, derived features, embeddings, multimodal inputs, and changing policy interpretations. Manual review remains important for judgment and accountability, but it cannot keep up with the volume and pace of change. 这就是隐私感知基础设施(PAI)背后的日常难题:输入是嘈杂且概率性的,但输出必须足够精确以驱动执行。AI 原生产品加剧了这一问题。它们引入了新的数据模态、更快的迭代周期、衍生特征、嵌入(embeddings)、多模态输入以及不断变化的策略解读。人工审核对于判断和问责仍然很重要,但它无法跟上数据量和变化的速度。
At Meta, we apply a hybrid pattern for asset classification at scale: Build a rich context before asking a model to reason. Use LLMs to handle ambiguity, cold start, and novelty. Keep human-reviewed labels separate from model-generated recommendations. Distill stable behavior into deterministic, versioned rules for routine enforcement. 在 Meta,我们采用了一种混合模式进行大规模资产分类:在要求模型推理之前,先构建丰富的上下文。利用大语言模型(LLM)处理歧义、冷启动和新颖性问题。将人工审核的标签与模型生成的建议分开。将稳定的行为提炼为确定性的、版本化的规则,用于日常执行。
The end goal is not “LLMs everywhere.” Instead, it is a system that can learn from ambiguous signals while moving production enforcement toward logic that is low latency, replayable, and easier to audit. The LLM does not make the production decision in the common case, deterministic rules do. We use LLMs deliberately and narrowly, to interpret novel or ambiguous assets, and then to distill what they learn into versioned human-reviewed deterministic rules, which steadily shrinks the LLM’s role in production over time. 最终目标不是“到处使用 LLM”。相反,我们追求的是一个能够从模糊信号中学习,同时将生产环境的执行逻辑转向低延迟、可重放且更易于审计的系统。在通常情况下,做出生产决策的不是 LLM,而是确定性规则。我们审慎且有限地使用 LLM 来解读新颖或模糊的资产,然后将其学习到的内容提炼为经过人工审核的版本化确定性规则,从而随着时间的推移稳步缩小 LLM 在生产中的作用。
Humans stay in the loop where it matters most. People adjudicate the reviewed reference labels, and they review and approve rule promotions that could change how protection is enforced. PAI addresses four operational concerns: Understand what data exists and how it is governed. Discover which data flows are relevant to a policy question. Enforce retention/access/purpose/sharing constraints. Demonstrate compliance through verifiable evidence. 人类在最关键的环节保持参与。人们负责裁定审核后的参考标签,并审查和批准可能改变保护执行方式的规则升级。PAI 解决了四个运营问题:了解存在哪些数据及其治理方式;发现哪些数据流与策略问题相关;执行保留/访问/用途/共享约束;通过可验证的证据证明合规性。
Asset classification sits at the understand layer. It provides the foundation that every downstream concern depends on. Figure 2: The privacy-aware infrastructure stack is a dependency pyramid: each capability rests on the one below it. Understand —classifying what the data actually is — is the load-bearing base. If it is wrong, everything above (discover, enforce, demonstrate) inherits the error. 资产分类位于“理解”层。它为所有下游关注点提供了基础。图 2:隐私感知基础设施栈是一个依赖金字塔:每一项能力都建立在下方能力之上。“理解”(即对数据实际属性进行分类)是承重基础。如果这一层出错,上方的一切(发现、执行、证明)都将继承该错误。
Why Asset Classification Matters
为什么资产分类至关重要
Asset classification is the foundation for many privacy controls. Before a system can enforce retention, access, allowed-purpose, downstream-sharing, or anonymization policies, it needs a reliable view of what the asset is and how it should be governed. An asset can be more than a table or column. It can be a nested field inside a payload, a log key, an event parameter, an API field, a machine learning (ML) feature, an embedding, or a derived dataset produced by an intermediate pipeline. 资产分类是许多隐私控制的基础。在系统能够执行保留、访问、允许用途、下游共享或匿名化策略之前,它需要对资产的属性及其治理方式有可靠的认知。资产不仅仅是表或列,它还可以是负载中的嵌套字段、日志键、事件参数、API 字段、机器学习(ML)特征、嵌入或由中间流水线产生的衍生数据集。
That breadth matters because AI-native systems often transform data across many representations. A single source signal can move through pipelines, become a feature, appear in a model-training workflow, or be joined with other derived signals. Classification has to follow the meaning of the data, not just its shape. 这种广度非常重要,因为 AI 原生系统经常在多种表现形式之间转换数据。单一的源信号可以穿过流水线,成为特征,出现在模型训练工作流中,或与其他衍生信号合并。分类必须遵循数据的含义,而不仅仅是其形态。
There are four recurring challenges: First, noisy and weak signals: Dozens of context fields are fetched per asset, which forces the model to rediscover what matters each time. High token usage dilutes attention, and decision boundaries get buried in irrelevant or misleading fields. A field called age in a caching pipeline is a concrete example: Without code resolution and lineage analysis, a classifier will trigger false restrictions on the entire pipeline. 存在四个反复出现的挑战:第一,嘈杂且微弱的信号:每个资产都会获取数十个上下文字段,这迫使模型每次都要重新发现关键信息。高 Token 使用量会分散注意力,决策边界被淹没在无关或误导性的字段中。缓存流水线中名为“age”的字段就是一个具体的例子:如果没有代码解析和血缘分析,分类器会对整个流水线触发错误的限制。
Second, the relevant context is distributed. Code, lineage, ownership, semantic annotations, documentation, and usage patterns often live in different systems. A good classifier needs to assemble that context before making a decision. 第二,相关上下文是分散的。代码、血缘、所有权、语义注释、文档和使用模式通常存在于不同的系统中。一个优秀的分类器需要在做出决策前整合这些上下文。
Third, requirements evolve. Product teams move quickly, and policy interpretation can change as new product capabilities appear. A static rule set or periodic manual review process can leave gaps between reviews. 第三,需求在不断演变。产品团队行动迅速,随着新产品功能的出现,策略解读也会发生变化。静态规则集或定期的人工审核流程可能会在审核间隔期留下漏洞。
Fourth, classification is only useful if it feeds enforcement. A false positive can trigger unnecessary restrictions downstream. A false negative can leave a protection gap. The classifier sits near the front of the enforcement pipeline, so its error profile affects every system that depends on it. This creates the central tension: Classification needs to reason under ambiguity, but enforcement needs decisions that can be explained and reproduced later. 第四,分类只有在能够驱动执行时才有意义。误报会导致下游不必要的限制,漏报则会留下保护漏洞。分类器位于执行流水线的前端,因此其错误特征会影响所有依赖它的系统。这造成了核心矛盾:分类需要在模糊条件下进行推理,而执行则需要可解释且可复现的决策。
Figure 3: Four distinct difficulties (context dependence, sparse signal, a heavy long tail, and constant schema drift) all collapse into a single tension: Classification wants to reason under ambiguity, while enforcement demands results it can explain and reproduce. The whole design exists to hold these two in balance. 图 3:四个不同的困难(上下文依赖、稀疏信号、沉重的长尾效应和持续的模式漂移)最终汇聚成一个核心矛盾:分类希望在模糊中推理,而执行则要求可解释、可复现的结果。整个设计旨在平衡这两者。
The Pattern
模式
Our approach is built around three principles that emerged from building and operating the system: First, context beats prompts. Most classification failures were not caused by weak instructions; they were caused by weak or missing evidence. Hours of prompt optimization produced marginal improvement when the model was reasoning over raw, noisy fields. Structuring context into evidence briefs, with supporting signals, contradicting signals, provenance, and… 我们的方法围绕着在系统构建和运营过程中总结出的三个原则:第一,上下文胜过提示词(Prompt)。大多数分类失败并非由指令薄弱引起,而是由证据不足或缺失引起。当模型在原始、嘈杂的字段上进行推理时,数小时的提示词优化只能带来微小的改进。将上下文结构化为证据摘要,包含支持信号、矛盾信号、来源以及……