BehaviorBench: Modeling Real-World User Decisions from Behavioral Traces

BehaviorBench：基于行为轨迹建模真实世界用户决策

Many decision-support settings require systems that adapt to individual users, but evaluation data for this problem remain limited. Existing benchmarks for user understanding often rely on simulated users or model-generated behavior, even though recent work cautions that model-based simulations can diverge systematically from human behavior.

许多决策支持场景要求系统能够适应个体用户，但针对该问题的评估数据仍然有限。现有的用户理解基准测试通常依赖于模拟用户或模型生成的行为，尽管近期研究警告称，基于模型的模拟可能会与人类行为产生系统性偏差。

We introduce \textsc{BehaviorBench}, a benchmark for evaluating personalized decision modeling from real-world behavioral traces. \textsc{BehaviorBench} reconstructs wallet-level decision histories from observed public prediction-market and on-chain records, and organizes them into two complementary task layers: \emph{Belief prediction}, which predicts a user’s final revealed stance and confidence in a market, and \emph{Trade prediction}, which predicts the direction and amount of individual transactions.

我们推出了 \textsc{BehaviorBench}，这是一个用于评估基于真实世界行为轨迹的个性化决策建模的基准测试。该基准通过观察到的公共预测市场和链上记录，重构了钱包层面的决策历史，并将其组织为两个互补的任务层：一是“信念预测”（Belief prediction），即预测用户在市场中最终表现出的立场和信心；二是“交易预测”（Trade prediction），即预测单笔交易的方向和金额。

Across 2,000 evaluation wallets, the benchmark contains 141,445 Belief instances and 1,485,972 Trade instances, with disjoint support pools for retrieval-based evaluation. We evaluate frontier and open-weight generative models under four history interfaces: no personalization, direct recent history, generated user profiles, and retrieved support-wallet evidence.

该基准涵盖了 2,000 个评估钱包，包含 141,445 个信念实例和 1,485,972 个交易实例，并为基于检索的评估提供了不相交的支持池。我们在四种历史接口下评估了前沿模型和开源权重生成模型：无个性化、直接近期历史、生成的用户画像以及检索到的支持钱包证据。

Personalization improves Belief prediction more consistently than Trade prediction, model rankings change across task layers and metrics, and different history interfaces expose different failure modes. \textsc{BehaviorBench} provides an evaluation setting for studying whether personalized methods can use real-world behavioral evidence rather than simulated users alone.

个性化对“信念预测”的改进比对“交易预测”更为一致；模型排名会随着任务层和指标的变化而改变；不同的历史接口也暴露了不同的失效模式。\textsc{BehaviorBench} 提供了一个评估环境，用于研究个性化方法是否能够利用真实世界的行为证据，而不仅仅是依赖模拟用户。