PoQ-Judge: A Multi-Architecture Evaluation Framework for Cost-Aware Proof-of-Quality in Decentralized LLM Inference

PoQ-Judge：一种用于去中心化大模型推理中成本感知质量证明的多架构评估框架

Abstract: Decentralized LLM inference networks need lightweight, reference-free quality evaluation for Proof of Quality (PoQ). We present PoQ-Judge, a framework that trains dedicated judge models to score query-output pairs without ground-truth references. 摘要： 去中心化大模型（LLM）推理网络需要轻量级、无需参考答案的质量评估机制来实现质量证明（PoQ）。我们提出了 PoQ-Judge，这是一个训练专用判别模型（judge models）的框架，能够在没有标准参考答案的情况下对“查询-输出”对进行评分。

We study three architectures across the quality-cost tradeoff: a TextCNN judge, a MiniLM cross-encoder, and a DeBERTa judge. Using two-stage training on UltraFeedback plus GPT-labeled in-domain data, the best model reaches 0.747 Pearson correlation with the ground-truth proxy on a held-out test set, outperforming reference-based evaluators from prior work. 我们研究了三种在质量与成本之间权衡的架构：TextCNN 判别器、MiniLM 交叉编码器（cross-encoder）和 DeBERTa 判别器。通过在 UltraFeedback 数据集及 GPT 标注的领域内数据上进行两阶段训练，表现最好的模型在留出测试集上与基准代理（ground-truth proxy）达到了 0.747 的皮尔逊相关系数，优于以往研究中基于参考答案的评估器。

As a reference-free component in composite scoring, it achieves 0.645 Pearson correlation, matching the best single reference-based evaluator while removing the need for reference answers. We also show that online calibration identifies semantic quality as the dominant dimension and that cascade evaluation reduces cost by 72.7 percent with only modest quality loss. 作为复合评分中的无参考组件，它实现了 0.645 的皮尔逊相关系数，在无需参考答案的情况下，达到了与最佳单一参考评估器相当的水平。我们还展示了在线校准（online calibration）能够识别出语义质量是核心维度，并且级联评估（cascade evaluation）在仅损失少量质量的情况下，将成本降低了 72.7%。

Results are much stronger on QA than summarization, pointing to proxy quality as the main remaining limitation. 实验结果显示，该模型在问答（QA）任务上的表现远优于摘要任务，这表明代理质量（proxy quality）是目前仍需解决的主要限制因素。