CR4T: Rewrite-Based Guardrails for Adolescent LLM Safety
CR4T: Rewrite-Based Guardrails for Adolescent LLM Safety
CR4T:面向青少年大模型安全的重写式护栏
Large language models (LLMs) are increasingly embedded in adolescent digital environments, mediating information seeking, advice, and emotionally sensitive interactions. Yet existing safety mechanisms remain largely grounded in adult-centric norms and operationalize safety through refusal-oriented suppression. 大型语言模型(LLM)正日益融入青少年的数字环境,在信息检索、建议获取及情感敏感互动中发挥着中介作用。然而,现有的安全机制大多基于以成人为中心的规范,并通过“拒绝式抑制”来实现安全管控。
While such approaches may reduce immediate policy violations, they can also create conversational dead-ends, limit constructive guidance, and fail to address the developmental vulnerabilities inherent in adolescent-AI interactions. We argue that adolescent LLM safety should be framed not solely as a filtering problem, but as a socio-technical, developmentally aligned transformation problem. 虽然这些方法或许能减少即时的违规行为,但它们也可能导致对话陷入僵局,限制建设性的引导,且无法解决青少年与人工智能互动中固有的发展性脆弱问题。我们认为,青少年大模型的安全问题不应仅仅被视为一个过滤问题,而应被视为一个社会技术性的、与发展阶段相适应的转化问题。
To operationalize this perspective, we propose Critique-and-Revise-for-Teenagers (CR4T), a model-agnostic safeguarding framework that selectively reconstructs unsafe or refusal-style outputs into age-appropriate, guidance-oriented responses while preserving benign intent. 为了落实这一观点,我们提出了“青少年批判与修订”(Critique-and-Revise-for-Teenagers,简称 CR4T)。这是一个与模型无关的防护框架,它能够有选择地将不安全或拒绝式的输出重构为适合年龄、以引导为导向的回复,同时保留其良性意图。
CR4T combines lightweight risk detection with domain-conditioned rewriting to remove risk-amplifying content, reduce unnecessary conversational shutdown, and introduce developmentally appropriate guidance. CR4T 将轻量级风险检测与领域条件重写相结合,旨在剔除放大风险的内容,减少不必要的对话中断,并引入符合青少年发展阶段的引导。
Experimental results show that targeted rewriting substantially reduces unsafe and refusal-oriented outcomes while avoiding unnecessary intervention on acceptable interactions. These findings suggest that selective response reconstruction offers a more human-centered alternative to refusal-centric guardrails for adolescent-facing LLM systems. 实验结果表明,针对性的重写显著减少了不安全和拒绝式的输出,同时避免了对正常互动进行不必要的干预。这些发现表明,对于面向青少年的大模型系统而言,选择性回复重构提供了一种比“拒绝式护栏”更具人文关怀的替代方案。