POLARIS: Guiding Small Models to Write Long Stories
POLARIS: Guiding Small Models to Write Long Stories
POLARIS:引导小型模型进行长篇故事创作
Small open-weight models struggle at long-form creative writing: their generated stories either fall far short of the requested length, or their quality significantly degrades as length increases, especially when compared to frontier models. 小型开源权重模型在长篇创意写作方面表现不佳:它们生成的故事要么远未达到要求的长度,要么随着长度的增加,质量会显著下降,尤其是在与前沿模型相比时。
We present POLARIS (Policy Optimization with LLM-as-a-judge rewards and Anchored-Reference Injection for Storywriting), a lower-compute GRPO recipe with two key ingredients: a frontier LLM judge with a structured Story Quality rubric as the online reward, and human-reference injection (HRI), where a teacher-forced human-written story serves as a high-reward anchor within each GRPO group. 我们提出了 POLARIS(基于 LLM 判别奖励的策略优化与故事写作锚定参考注入),这是一种低算力需求的 GRPO(组相对策略优化)方案,包含两个关键要素:一是使用具有结构化“故事质量评分标准”的前沿 LLM 作为在线奖励判别器;二是引入人类参考注入(HRI),即在每个 GRPO 组中,将人类编写的故事作为高奖励锚点进行教师强制(teacher-forced)训练。
By applying our training recipe to Qwen3.5-9B, using a dataset of approximately 1.4K prompt-story pairs derived from 100 short-story anthologies and 4 A100 GPUs, we obtain POLARIS-9B. 通过将我们的训练方案应用于 Qwen3.5-9B 模型,并使用从 100 部短篇小说集中提取的约 1,400 个“提示词-故事”对作为数据集,在 4 张 A100 GPU 上进行训练,我们得到了 POLARIS-9B 模型。
Across five benchmarks spanning in-distribution and out-of-distribution prompts and rubrics, POLARIS-9B is competitive with much larger open-weight models while following length instructions more closely. 在涵盖分布内和分布外提示词及评分标准的五个基准测试中,POLARIS-9B 不仅能与规模大得多的开源模型竞争,而且在遵循长度指令方面表现得更为精准。
A blinded human evaluation confirms that POLARIS-9B is preferred to the base Qwen3.5-9B and on par with Qwen3.5-27B. 盲测评估证实,用户更倾向于使用 POLARIS-9B 而非基础版 Qwen3.5-9B,且其表现与 Qwen3.5-27B 持平。
Despite training only on stories up to 4k words, POLARIS-9B preserves quality on prompts requesting stories up to 3 times the training length, a regime where most open-weight models degrade substantially in quality, length adherence, or both. 尽管训练时仅使用了不超过 4,000 字的故事,但 POLARIS-9B 在面对要求长度达到训练长度 3 倍的提示词时,依然保持了高质量。在这一区间内,大多数开源模型通常会在质量、长度遵循度或两者兼有的方面出现大幅下降。
More broadly, our results suggest that length generalization is a meaningful stress test for creative-writing models and a useful lens for distinguishing otherwise close models. 从更广泛的角度来看,我们的研究结果表明,长度泛化能力是衡量创意写作模型的一项重要压力测试,也是区分性能相近模型的一种有效视角。