Know2Guess: A Contamination-Aware Multi-Zone Benchmark for Knowledge-Boundary Evaluation in Large Language Models
Know2Guess: A Contamination-Aware Multi-Zone Benchmark for Knowledge-Boundary Evaluation in Large Language Models
Know2Guess:用于大语言模型知识边界评估的抗污染多区域基准测试
Reliable evaluation of large language models should separate supported answering from unsupported guessing without conflating either with data contamination, prompt idiosyncrasy, or generic refusal behavior. 对大语言模型进行可靠的评估,应当将“有据可依的回答”与“无根据的猜测”区分开来,且不能将其与数据污染、提示词特质或通用的拒绝行为混为一谈。
We present a contamination-aware, multi-zone benchmark for measuring the transition from answerable knowledge to abstention-expected unknowns under frozen build-time labels. The benchmark contains 1,200 items across five domains, explicit abstention expectations, contamination-risk metadata, and dual parsing with an official strict parser plus a normalized robustness parser. 我们提出了一个具备抗污染能力的多区域基准测试,旨在衡量在固定的构建时标签下,模型从“可回答知识”到“预期弃权未知领域”的过渡。该基准测试涵盖了五个领域的 1,200 个条目,包含明确的弃权预期、污染风险元数据,并采用了双重解析机制:即官方严格解析器与归一化鲁棒性解析器。
We evaluate FLAN-T5, Qwen2.5-Instruct, and Llama-3-Instruct models under locked answer-or-abstain prompts, answer-only controls, and prompt-template variants. The benchmark is not solved by generic non-answer behavior: FLAN baselines remain weak on productive abstention, while stronger instruction-tuned models expose a selective but incomplete transition from answering to abstaining. 我们评估了 FLAN-T5、Qwen2.5-Instruct 和 Llama-3-Instruct 模型,测试环境包括锁定的“回答或弃权”提示词、仅回答控制组以及提示词模板变体。该基准测试无法通过通用的“非回答”行为来破解:FLAN 基线模型在有效的弃权表现上依然较弱,而更强大的指令微调模型则表现出从回答到弃权的过渡具有选择性,但尚不完整。
Qwen2.5-3B-Instruct achieves the best overall reliability, but answer-expected zones remain difficult, calibration remains poor, and benign-item refusal persists. Prompt and parser robustness analyses preserve the main ranking and qualitative conclusions. Qwen2.5-3B-Instruct 实现了最佳的整体可靠性,但在“预期回答区域”仍存在困难,校准效果依然较差,且对良性条目的拒绝现象依然存在。提示词和解析器的鲁棒性分析验证了主要排名和定性结论的稳定性。
The benchmark therefore provides a reproducible protocol for auditing answerability, abstention, refusal, and contamination as distinct but interacting dimensions of LLM behavior. The dataset is publicly available at this URL. 因此,该基准测试提供了一套可复现的协议,用于审计大语言模型行为中可回答性、弃权、拒绝和污染这四个既独立又相互作用的维度。数据集现已通过此链接公开。