Position: Let's Develop Data Probes to Fundamentally Understand How Data Affects LLM Performance

Position: Let’s Develop Data Probes to Fundamentally Understand How Data Affects LLM Performance

立场：让我们开发“数据探针”，从根本上理解数据如何影响大语言模型（LLM）的性能

Abstract: Data is fundamental to large language models (LLMs). However, understanding of what makes certain data useful for different stages of an LLM workflow, including training, tuning, alignment, in-context learning, etc., and why, remains an open question. 摘要： 数据是大语言模型（LLM）的基石。然而，究竟是什么因素使得特定数据在 LLM 工作流的不同阶段（包括训练、微调、对齐、上下文学习等）发挥作用，以及其背后的原因，目前仍是一个未解之谜。

Current approaches rely heavily on extensive experimentation with large public datasets to obtain empirical heuristics for data filtering and dataset construction. These approaches are compute intensive and lack a principled way of understanding the essence of how specific data characteristics drive LLM behavior. 目前的方法主要依赖于对大型公共数据集进行广泛的实验，以获取用于数据过滤和数据集构建的经验启发式规则。这些方法不仅计算密集，而且缺乏一种原则性的方法来理解特定数据特征如何驱动 LLM 行为的本质。

In this position paper, we advocate for the need of developing systematic methodologies for generating synthetic sequences from appropriately defined random processes, with the goal that these sequences can reveal useful characteristics when they are used in one or multiple stages of the LLM workflow. We refer to such sequences as data probes. 在这篇立场论文中，我们主张有必要开发系统化的方法论，通过定义合理的随机过程来生成合成序列，目标是让这些序列在应用于 LLM 工作流的一个或多个阶段时，能够揭示出有用的特征。我们将此类序列称为“数据探针”（data probes）。

By observing LLM behavior on data probes, researchers can systematically conduct studies on how data characteristics influence model performance, generalization, and robustness. The probing sequences exhibit statistical properties that can be viewed using theoretical concepts, such as typical sets, which are generalized to describe the behaviors of LLMs. 通过观察 LLM 在数据探针上的表现，研究人员可以系统地研究数据特征如何影响模型的性能、泛化能力和鲁棒性。这些探针序列展现出的统计特性，可以利用诸如“典型集”（typical sets）等理论概念进行分析，并将其推广以描述 LLM 的行为。

This data-probe approach provides a pathway for uncovering foundational insights into the role of data in LLM training and inference, beyond empirical heuristics. 这种数据探针方法为揭示数据在 LLM 训练和推理中的作用提供了基础性见解，超越了单纯的经验启发式方法。

Paper Details:

Authors: Shiqiang Wang, Herbert Woisetschläger, Hans Arno Jacobsen, Mingyue Ji
arXiv ID: 2605.18801
Submission Date: 11 May 2026

论文详情：

作者： Shiqiang Wang, Herbert Woisetschläger, Hans Arno Jacobsen, Mingyue Ji
arXiv ID： 2605.18801
提交日期： 2026年5月11日