Training Therapeutic Judges and Multi-Agent Systems for Human-Aligned Mental Health Support
Training Therapeutic Judges and Multi-Agent Systems for Human-Aligned Mental Health Support
训练用于人类对齐心理健康支持的治疗性评估器与多智能体系统
Abstract: Large language models show promise for mental health support, yet therapeutic quality improves only when evaluation functions as an actionable control signal rather than a passive metric. We introduce a framework that formulates therapeutic response generation as a decision-refinement problem driven by multi-dimensional, human-aligned evaluation.
摘要: 大型语言模型在心理健康支持方面展现出巨大潜力,但只有当评估不再仅仅作为被动指标,而是转化为可操作的控制信号时,治疗质量才能得到真正提升。我们引入了一个框架,将治疗性回复的生成过程构建为一个由多维度、人类对齐评估驱动的决策优化问题。
In Stage I, we introduce TheraJudge, an open-source therapeutic evaluator trained via preference-based optimization on human-annotated data to produce reliable judgments across 7 psychological dimensions. In Stage II, we introduce TheraAgent, which operationalizes TheraJudge’s evaluations through a coordinated refinement process with specialized Critic, Coach, and Therapist roles that translate evaluative signals into targeted response revisions.
在第一阶段,我们推出了 TheraJudge,这是一个开源的治疗评估器。它通过基于人类标注数据的偏好优化进行训练,能够在 7 个心理维度上提供可靠的评估。在第二阶段,我们引入了 TheraAgent,它通过专门的“评论家”、“教练”和“治疗师”角色,将 TheraJudge 的评估结果转化为协同优化过程,从而将评估信号转化为针对性的回复修订。
Empirically, TheraJudge achieves strong agreement with clinician ratings, with intraclass correlation coefficients (ICC = 0.87-0.95), surpassing supervised baselines and strong closed-source judges, particularly on critical dimensions such as Safety, Relevance, and Empathy. Acting on these evaluations, TheraAgent yields a +0.43 improvement in human-rated therapeutic quality (on a 5-point scale) under blind evaluation, with 96% clinician inter-rater reliability.
实验证明,TheraJudge 与临床医生的评分高度一致,组内相关系数(ICC)达到 0.87-0.95,超越了监督学习基准模型和强大的闭源评估器,特别是在安全性、相关性和共情能力等关键维度上表现优异。基于这些评估,TheraAgent 在盲测中使人类评估的治疗质量提升了 +0.43 分(5 分制),且临床医生间的评分一致性达到 96%。
Low-quality responses ($\leq 3$) improve by +2.45 points with a 94% recovery rate, demonstrating targeted correction of unsafe outputs. Overall, our results indicate that effective alignment of mental-health LLMs stems from acting on human-aligned evaluation, rather than relying solely on stronger generation. We release code at this https URL.
对于低质量回复($\leq 3$ 分),模型实现了 +2.45 分的提升,恢复率高达 94%,证明了其对不安全输出的精准修正能力。总体而言,我们的研究结果表明,心理健康领域大模型的有效对齐源于对人类对齐评估的积极响应,而非仅仅依赖于更强大的生成能力。我们已在指定链接开源了相关代码。