PQR: A Framework to Generate Diverse and Realistic User Queries that Elicit QA Agent Failures

PQR：一个生成多样化且真实的用户查询以诱发问答代理故障的框架

Abstract: Evaluating LLM-based agents remains challenging because identifying meaningful failure cases often requires substantial human effort to design realistic test scenarios. Prior works primarily focus on automatically discovering agent failures induced by adversarial users, while overlooking queries with real user intents that also trigger agent failures. 摘要： 评估基于大语言模型（LLM）的代理仍然具有挑战性，因为识别有意义的故障案例通常需要投入大量人力来设计真实的测试场景。以往的研究主要集中在自动发现由对抗性用户引发的代理故障，却忽略了同样会触发代理故障的真实用户意图查询。

We introduce PQR, a framework that not only surfaces agent failures with respect to specific objectives (e.g., helpfulness, safety, etc.) but also resembles real users’ intents. PQR operates through an iterative interaction between two complementary modules. 我们引入了 PQR，这是一个不仅能针对特定目标（如有用性、安全性等）揭示代理故障，而且还能模拟真实用户意图的框架。PQR 通过两个互补模块之间的迭代交互来运行。

The query refinement module performs rewrites to explore diverse query variations, while the prompt refinement module uses prior feedback to derive new objective-violating strategies and realism policies for refining prompts, which in turn generate failure-triggering yet realistic queries. 查询优化模块执行重写以探索多样化的查询变体，而提示词优化模块则利用先前的反馈来推导新的目标违规策略和真实性策略，从而优化提示词，进而生成既能触发故障又具备真实性的查询。

We evaluate PQR on detecting an e-commerce QA agent’s unhelpful responses. Our method uncovers 23% - 78% more unhelpful responses, and our generated queries are more diverse and realistic compared to previous methods. 我们在检测电子商务问答代理的无用回复方面对 PQR 进行了评估。我们的方法多发现了 23% - 78% 的无用回复，且与以往的方法相比，我们生成的查询更加多样化和真实。