Redact or Keep? A Fully Local AI Cascade for Educational Dialogue De-Identification

删减还是保留？一种用于教育对话去标识化的全本地 AI 级联框架

Abstract: Educational dialogue is a valuable but sensitive resource for research: the same transcripts that capture authentic learning often capture personally identifiable information (PII) entangled with curricular content, where “Riemann” may refer to a real student or to a mathematical concept. 摘要： 教育对话是研究中极具价值但又十分敏感的资源：记录真实学习过程的转录文本，往往也包含了与课程内容交织在一起的个人身份信息（PII），例如“黎曼（Riemann）”一词既可能指代一名真实的学生，也可能指代一个数学概念。

Existing approaches force a tradeoff between governance and accuracy. Commercial Large Language Models (LLMs) can handle this ambiguity but require sending student data to third parties, while local named entity recognition (NER) systems preserve governance but over-redact curricular terms. 现有的方法往往需要在数据治理与准确性之间做出权衡。商业大语言模型（LLM）虽然能够处理这种歧义，但需要将学生数据发送给第三方；而本地命名实体识别（NER）系统虽然能保障数据治理，却往往会过度删减课程术语。

We propose a fully local cascade framework that reframes de-identification from open-ended entity recognition to constrained privacy triage. A recall-first union proposer combines two lightweight encoders with deterministic rules to over-generate candidate spans; a context-aware reviewer then makes a binary Redact/Keep decision for each candidate using surrounding dialogue and speaker role. 我们提出了一种全本地的级联框架，将去标识化任务从开放式的实体识别重新定义为受限的隐私分类（privacy triage）。该框架首先通过一个“召回优先”的联合提议器，结合两个轻量级编码器和确定性规则，对候选片段进行过量生成；随后，一个具备上下文感知能力的审查器利用周围对话内容和说话人角色，对每个候选片段做出“删减”或“保留”的二元决策。

We evaluate three reviewer configurations against same-family LLM-only baselines and a commercial API on math tutoring transcripts from two large platforms. The strongest local configuration reaches 0.958 macro F1, compared with 0.767 for a same-family LLM-only baseline and 0.706 for the commercial API, while running entirely on a single laptop. 我们在两个大型平台的数学辅导转录文本上，对比了三种审查器配置与同系列纯 LLM 基线模型及商业 API 的表现。结果显示，性能最强的本地配置达到了 0.958 的宏观 F1 分数，而同系列纯 LLM 基线为 0.767，商业 API 为 0.706，且该系统完全可以在单台笔记本电脑上运行。

On a targeted challenge set of curricular-personal name ambiguity, the same configuration degrades by only 0.03 F1 versus 0.19 to 0.25 for smaller reviewers. These results suggest that for educational de-identification, problem formulation matters more than model scale. 在针对课程术语与人名歧义的挑战集测试中，该配置的 F1 分数仅下降了 0.03，而较小的审查器模型则下降了 0.19 到 0.25。这些结果表明，对于教育领域的去标识化任务而言，问题建模的重要性远超模型规模本身。