BenSyc: Benchmarking Conversational Sycophancy and Human Alignment in LLMs for Bengali Contexts

BenSyc：孟加拉语境下大语言模型对话谄媚与人类对齐的基准测试

Abstract: Large language models (LLMs) increasingly participate in emotionally sensitive social conversations, where responses may shift from balanced support toward excessive validation or escalatory alignment. Existing sycophancy research primarily focuses on factual agreement and instruction-following settings, leaving culturally grounded conversational sycophancy underexplored.

摘要： 大语言模型（LLMs）正越来越多地参与到情感敏感的社交对话中，其回复可能会从平衡的支持转向过度的认同或升级式的附和。现有的谄媚（Sycophancy）研究主要集中在事实一致性和指令遵循场景，而对基于文化背景的对话谄媚研究尚显不足。

We introduce BenSyc, the first benchmark for studying conversational sycophancy in Bengali social contexts. Starting from 11,840 Reddit posts and 170k comments collected from communities across Bangladesh and West Bengal, we construct a human-validated benchmark with binary labels and a fine-grained five-level taxonomy spanning Invalidation, Neutral, Support, Validation, and Escalation.

我们推出了 BenSyc，这是首个用于研究孟加拉社交语境下对话谄媚的基准测试。该基准基于从孟加拉国和西孟加拉邦社区收集的 11,840 条 Reddit 帖子和 17 万条评论构建而成，包含人工验证的二元标签，以及涵盖“否定（Invalidation）”、“中立（Neutral）”、“支持（Support）”、“认同（Validation）”和“升级（Escalation）”的五级细粒度分类体系。

We evaluate more than 15 open and proprietary LLMs on conversational alignment classification and response generation tasks. Results show that distinguishing empathetic support from reinforcement-oriented validation remains challenging even for frontier instruction-tuned models: the best system achieves only 61.8 Macro-F1 on binary detection and 61.7 Macro-F1 on five-class classification.

我们对 15 个以上的开源及闭源大语言模型进行了对话对齐分类和回复生成任务的评估。结果表明，即使对于前沿的指令微调模型，区分“共情支持”与“强化导向的认同”仍然具有挑战性：表现最好的系统在二元检测任务上仅达到 61.8 的 Macro-F1 分数，在五分类任务上仅达到 61.7 的 Macro-F1 分数。

In generation settings, several models frequently produce strongly validating or escalatory responses in emotionally charged situations. Our findings highlight substantial variation across model families and conversational behaviors, underscoring the importance of culturally grounded multilingual benchmarks for evaluating socially aligned conversational AI systems.

在生成任务中，多个模型在情绪激动的场景下频繁产生强烈的认同或升级式回复。我们的研究结果强调了不同模型家族和对话行为之间存在显著差异，并凸显了基于文化背景的多语言基准测试对于评估社会对齐对话式人工智能系统的重要性。