PathoSage: Towards Multi-Source Evidence Adjudication in Pathology via Experience-Aware Agentic Workflow

PathoSage: Towards Multi-Source Evidence Adjudication in Pathology via Experience-Aware Agentic Workflow

PathoSage:通过经验感知智能体工作流实现病理学多源证据裁决

Abstract: Recent advances in Multimodal Large Language Models (MLLMs) and agent workflows have shown strong promise for computational pathology, yet reliable patch-level reasoning remains challenging. End-to-end pathology MLLMs often hallucinate morphological features, while recent agentic systems usually merge tool outputs and retrieved knowledge into a shared context, making decisions vulnerable to conflicting evidence and context contamination.

摘要: 多模态大语言模型(MLLMs)和智能体工作流的最新进展为计算病理学展现了巨大潜力,但实现可靠的切片级(patch-level)推理仍然充满挑战。端到端的病理学 MLLMs 经常会对形态学特征产生幻觉,而近期的智能体系统通常将工具输出和检索到的知识合并到一个共享上下文中,这使得决策容易受到冲突证据和上下文污染的影响。

We propose PathoSage, a three-stage framework that explicitly separates knowledge retrieval, evidence collection, and evidence adjudication for patch-level pathology multimodal reasoning. Its core component, Structured Evidence Deliberation, independently evaluates heterogeneous evidence from tools, performs conflict analysis, and generates the final judgment in a fresh context to reduce anchoring bias.

我们提出了 PathoSage,这是一个三阶段框架,明确将知识检索、证据收集和证据裁决分离开来,用于切片级病理学多模态推理。其核心组件“结构化证据审议”(Structured Evidence Deliberation)能够独立评估来自工具的异构证据,执行冲突分析,并在全新的上下文中生成最终判断,从而减少锚定偏差。

We further introduce a training-free Beta-Bernoulli experience system with continuous credit assignment to model long-term tool reliability and construct similarity-weighted priors for future tool use. Experiments show that PathoSage effectively mitigates VQA hallucinations and classifier disagreement, outperforming strong pathology MLLM and agentic baselines. Our results highlight explicit evidence adjudication and reliability-aware tool modeling as key ingredients for robust pathology agents.

我们进一步引入了一个无需训练的 Beta-Bernoulli 经验系统,通过持续的信用分配来建模工具的长期可靠性,并为未来的工具使用构建相似度加权先验。实验表明,PathoSage 有效缓解了视觉问答(VQA)幻觉和分类器分歧,性能优于强大的病理学 MLLM 和智能体基准模型。我们的研究结果强调,明确的证据裁决和可靠性感知工具建模是构建稳健病理学智能体的关键要素。