Evaluating the Utility of Personal Health Records in Personalized Health AI

评估个人健康记录在个性化健康人工智能中的效用

Abstract: Patient-managed Personal Health Records (PHRs) promises to empower patients to better understand their health; but information in the record is complex, potentially hindering insights. In this study, we assess the potential of large language models (LLMs, Gemini 3.0 Flash) to provide helpful answers to user health queries, when provided clinical data from PHRs as context.

摘要： 患者管理的个人健康记录（PHR）有望赋能患者，使其更好地了解自身健康状况；但记录中的信息十分复杂，可能会阻碍对信息的洞察。在本研究中，我们评估了大型语言模型（LLM，Gemini 3.0 Flash）在提供来自 PHR 的临床数据作为上下文时，为用户健康查询提供有效回答的潜力。

A total of 2,257 user queries were drawn from 3 different distributions to represent patient questions: shorter web search queries, longer questions derived from templates of chatbot conversations, and questions patients asked to their healthcare team (patient calls). Queries were matched with de-identified PHRs (from a pool of 1,945).

研究共从 3 个不同的分布中提取了 2,257 条用户查询，以代表患者的问题：较短的网络搜索查询、源自聊天机器人对话模板的较长问题，以及患者向医疗团队提出的问题（患者咨询）。这些查询与去标识化的 PHR（来自 1,945 个样本库）进行了匹配。

Gemini responses were generated (1) without PHR context; (2) with a basic summary of demographics, conditions, and medications; (3) with full, extensive clinical notes. For evaluation, we leveraged an existing rating framework (SHARP), and developed a new framework for specific error modes when interpreting PHRs. Evaluation was performed using autoraters for the full set, and with clinician ratings for a subset (n=95), with both sets of raters knowing the full PHR context.

Gemini 的回答生成方式分为三种：（1）不提供 PHR 上下文；（2）提供人口统计学、病症和药物的基本摘要；（3）提供完整、详尽的临床记录。在评估方面，我们利用了现有的评分框架（SHARP），并针对解读 PHR 时出现的特定错误模式开发了一个新框架。评估过程使用了自动评分器对全集进行评估，并由临床医生对子集（n=95）进行评分，两组评分者均知晓完整的 PHR 上下文。

We see significant improvements in the helpfulness of answers to all question types with PHR data (p < 0.001, paired t-test). We also observe potential gains in safety, accuracy, relevance and personalization of answers. Our PHR evaluation framework further identifies gaps in LLM understanding of particular aspects of complex PHRs, such as temporal disorientation, and rare but meaningful confabulations.

研究发现，在引入 PHR 数据后，所有类型问题的回答有效性均有显著提升（p < 0.001，配对 t 检验）。我们还观察到回答在安全性、准确性、相关性和个性化方面均有潜在提升。我们的 PHR 评估框架进一步识别了 LLM 在理解复杂 PHR 特定方面存在的不足，例如时间顺序混乱以及罕见但有意义的幻觉现象。

These results suggest potential for PHR data to help people with a wide range of user needs; and provide a framework for monitoring for gaps in LLM answers based on PHR context. This study motivates further work to assess and realize potential benefits to users from understanding their health records.

这些结果表明，PHR 数据在满足广泛用户需求方面具有潜力，并为监测基于 PHR 上下文的 LLM 回答缺陷提供了一个框架。本研究推动了后续工作的开展，旨在评估并实现用户通过了解自身健康记录所能获得的潜在益处。