How Does Differential Privacy Affect Social Bias in LLMs? A Systematic Evaluation

差分隐私如何影响大语言模型中的社会偏见？一项系统性评估

Abstract: Large language models (LLMs) trained on web-scale corpora can memorize sensitive training data, posing significant privacy risks. Differential privacy (DP) has emerged as a principled framework that limits the influence of individual data points during training, yet the relationship between differential privacy and social bias in LLMs remains poorly understood.

摘要： 在网络规模语料库上训练的大语言模型（LLM）可能会记忆敏感的训练数据，从而带来重大的隐私风险。差分隐私（DP）作为一种原则性框架，能够限制训练过程中单个数据点的影响，但差分隐私与大语言模型中社会偏见之间的关系目前尚不明确。

To investigate this, we present a systematic evaluation of social bias in a pretrained LLM trained with DP-SGD, comparing a DP model against non-DP baselines across four complementary paradigms: sentence scoring, text completion, tabular classification, and question answering.

为了探究这一问题，我们对使用 DP-SGD 训练的预训练大语言模型中的社会偏见进行了系统性评估，通过四种互补的范式（句子评分、文本补全、表格分类和问答）将差分隐私模型与非差分隐私基准模型进行了对比。

We find that DP reduces bias in sentence scoring tasks, where bias is measured through controlled likelihood comparisons, yet this improvement does not generalize across all tasks. Our results reveal a discrepancy between logit-level bias and output-level bias.

研究发现，在通过受控似然比较来衡量偏见的句子评分任务中，差分隐私确实降低了偏见，但这种改善并不能推广到所有任务中。我们的结果揭示了 Logit 层面的偏见与输出层面的偏见之间存在差异。

Moreover, decreasing memorization does not necessarily reduce unfairness, underscoring the importance of multi-paradigm evaluation when assessing fairness in LLMs.

此外，减少记忆并不一定能减少不公平性，这强调了在评估大语言模型公平性时，采用多范式评估的重要性。