CR4T: Rewrite-Based Guardrails for Adolescent LLM Safety

CR4T：面向青少年大模型安全的重写式护栏

Large language models (LLMs) are increasingly embedded in adolescent digital environments, mediating information seeking, advice, and emotionally sensitive interactions. Yet existing safety mechanisms remain largely grounded in adult-centric norms and operationalize safety through refusal-oriented suppression. 大型语言模型（LLM）正日益融入青少年的数字环境，在信息检索、建议获取及情感敏感互动中发挥着中介作用。然而，现有的安全机制大多基于以成人为中心的规范，并通过“拒绝式抑制”来实现安全管控。

While such approaches may reduce immediate policy violations, they can also create conversational dead-ends, limit constructive guidance, and fail to address the developmental vulnerabilities inherent in adolescent-AI interactions. We argue that adolescent LLM safety should be framed not solely as a filtering problem, but as a socio-technical, developmentally aligned transformation problem. 虽然这些方法或许能减少即时的违规行为，但它们也可能导致对话陷入僵局，限制建设性的引导，且无法解决青少年与人工智能互动中固有的发展性脆弱问题。我们认为，青少年大模型的安全问题不应仅仅被视为一个过滤问题，而应被视为一个社会技术性的、与发展阶段相适应的转化问题。

To operationalize this perspective, we propose Critique-and-Revise-for-Teenagers (CR4T), a model-agnostic safeguarding framework that selectively reconstructs unsafe or refusal-style outputs into age-appropriate, guidance-oriented responses while preserving benign intent. 为了落实这一观点，我们提出了“青少年批判与修订”（Critique-and-Revise-for-Teenagers，简称 CR4T）。这是一个与模型无关的防护框架，它能够有选择地将不安全或拒绝式的输出重构为适合年龄、以引导为导向的回复，同时保留其良性意图。

CR4T combines lightweight risk detection with domain-conditioned rewriting to remove risk-amplifying content, reduce unnecessary conversational shutdown, and introduce developmentally appropriate guidance. CR4T 将轻量级风险检测与领域条件重写相结合，旨在剔除放大风险的内容，减少不必要的对话中断，并引入符合青少年发展阶段的引导。

Experimental results show that targeted rewriting substantially reduces unsafe and refusal-oriented outcomes while avoiding unnecessary intervention on acceptable interactions. These findings suggest that selective response reconstruction offers a more human-centered alternative to refusal-centric guardrails for adolescent-facing LLM systems. 实验结果表明，针对性的重写显著减少了不安全和拒绝式的输出，同时避免了对正常互动进行不必要的干预。这些发现表明，对于面向青少年的大模型系统而言，选择性回复重构提供了一种比“拒绝式护栏”更具人文关怀的替代方案。