Training Therapeutic Judges and Multi-Agent Systems for Human-Aligned Mental Health Support

训练用于人类对齐心理健康支持的治疗性评估器与多智能体系统

Abstract: Large language models show promise for mental health support, yet therapeutic quality improves only when evaluation functions as an actionable control signal rather than a passive metric. We introduce a framework that formulates therapeutic response generation as a decision-refinement problem driven by multi-dimensional, human-aligned evaluation.

摘要： 大型语言模型在心理健康支持方面展现出巨大潜力，但只有当评估不再仅仅作为被动指标，而是转化为可操作的控制信号时，治疗质量才能得到真正提升。我们引入了一个框架，将治疗性回复的生成过程构建为一个由多维度、人类对齐评估驱动的决策优化问题。

In Stage I, we introduce TheraJudge, an open-source therapeutic evaluator trained via preference-based optimization on human-annotated data to produce reliable judgments across 7 psychological dimensions. In Stage II, we introduce TheraAgent, which operationalizes TheraJudge’s evaluations through a coordinated refinement process with specialized Critic, Coach, and Therapist roles that translate evaluative signals into targeted response revisions.

在第一阶段，我们推出了 TheraJudge，这是一个开源的治疗评估器。它通过基于人类标注数据的偏好优化进行训练，能够在 7 个心理维度上提供可靠的评估。在第二阶段，我们引入了 TheraAgent，它通过专门的“评论家”、“教练”和“治疗师”角色，将 TheraJudge 的评估结果转化为协同优化过程，从而将评估信号转化为针对性的回复修订。

Empirically, TheraJudge achieves strong agreement with clinician ratings, with intraclass correlation coefficients (ICC = 0.87-0.95), surpassing supervised baselines and strong closed-source judges, particularly on critical dimensions such as Safety, Relevance, and Empathy. Acting on these evaluations, TheraAgent yields a +0.43 improvement in human-rated therapeutic quality (on a 5-point scale) under blind evaluation, with 96% clinician inter-rater reliability.

实验证明，TheraJudge 与临床医生的评分高度一致，组内相关系数（ICC）达到 0.87-0.95，超越了监督学习基准模型和强大的闭源评估器，特别是在安全性、相关性和共情能力等关键维度上表现优异。基于这些评估，TheraAgent 在盲测中使人类评估的治疗质量提升了 +0.43 分（5 分制），且临床医生间的评分一致性达到 96%。

Low-quality responses ($\leq 3$) improve by +2.45 points with a 94% recovery rate, demonstrating targeted correction of unsafe outputs. Overall, our results indicate that effective alignment of mental-health LLMs stems from acting on human-aligned evaluation, rather than relying solely on stronger generation. We release code at this https URL.

对于低质量回复（$\leq 3$ 分），模型实现了 +2.45 分的提升，恢复率高达 94%，证明了其对不安全输出的精准修正能力。总体而言，我们的研究结果表明，心理健康领域大模型的有效对齐源于对人类对齐评估的积极响应，而非仅仅依赖于更强大的生成能力。我们已在指定链接开源了相关代码。