Know2Guess: A Contamination-Aware Multi-Zone Benchmark for Knowledge-Boundary Evaluation in Large Language Models

Know2Guess：用于大语言模型知识边界评估的抗污染多区域基准测试

Reliable evaluation of large language models should separate supported answering from unsupported guessing without conflating either with data contamination, prompt idiosyncrasy, or generic refusal behavior. 对大语言模型进行可靠的评估，应当将“有据可依的回答”与“无根据的猜测”区分开来，且不能将其与数据污染、提示词特质或通用的拒绝行为混为一谈。

We present a contamination-aware, multi-zone benchmark for measuring the transition from answerable knowledge to abstention-expected unknowns under frozen build-time labels. The benchmark contains 1,200 items across five domains, explicit abstention expectations, contamination-risk metadata, and dual parsing with an official strict parser plus a normalized robustness parser. 我们提出了一个具备抗污染能力的多区域基准测试，旨在衡量在固定的构建时标签下，模型从“可回答知识”到“预期弃权未知领域”的过渡。该基准测试涵盖了五个领域的 1,200 个条目，包含明确的弃权预期、污染风险元数据，并采用了双重解析机制：即官方严格解析器与归一化鲁棒性解析器。

We evaluate FLAN-T5, Qwen2.5-Instruct, and Llama-3-Instruct models under locked answer-or-abstain prompts, answer-only controls, and prompt-template variants. The benchmark is not solved by generic non-answer behavior: FLAN baselines remain weak on productive abstention, while stronger instruction-tuned models expose a selective but incomplete transition from answering to abstaining. 我们评估了 FLAN-T5、Qwen2.5-Instruct 和 Llama-3-Instruct 模型，测试环境包括锁定的“回答或弃权”提示词、仅回答控制组以及提示词模板变体。该基准测试无法通过通用的“非回答”行为来破解：FLAN 基线模型在有效的弃权表现上依然较弱，而更强大的指令微调模型则表现出从回答到弃权的过渡具有选择性，但尚不完整。

Qwen2.5-3B-Instruct achieves the best overall reliability, but answer-expected zones remain difficult, calibration remains poor, and benign-item refusal persists. Prompt and parser robustness analyses preserve the main ranking and qualitative conclusions. Qwen2.5-3B-Instruct 实现了最佳的整体可靠性，但在“预期回答区域”仍存在困难，校准效果依然较差，且对良性条目的拒绝现象依然存在。提示词和解析器的鲁棒性分析验证了主要排名和定性结论的稳定性。

The benchmark therefore provides a reproducible protocol for auditing answerability, abstention, refusal, and contamination as distinct but interacting dimensions of LLM behavior. The dataset is publicly available at this URL. 因此，该基准测试提供了一套可复现的协议，用于审计大语言模型行为中可回答性、弃权、拒绝和污染这四个既独立又相互作用的维度。数据集现已通过此链接公开。