The Constraint Tax: Measuring Validity-Correctness Tradeoffs in Structured Outputs for Small Language Models

The Constraint Tax: Measuring Validity-Correctness Tradeoffs in Structured Outputs for Small Language Models

约束税:衡量小型语言模型结构化输出中有效性与正确性的权衡

Abstract: Production LLM systems increasingly require machine-readable outputs: JSON objects, typed traces, regex constrained fields, and tool-call schemas. This paper targets on-device and low-cost small language model (SLM) deployments, where sub-3B models are attractive for privacy, latency, and commodity hardware but have limited capacity to satisfy schemas while solving tasks.

摘要: 生产环境中的大语言模型(LLM)系统越来越需要机器可读的输出:如 JSON 对象、类型化追踪、正则表达式约束字段以及工具调用模式。本文针对端侧和低成本的小型语言模型(SLM)部署场景,在这些场景中,30 亿参数以下(sub-3B)的模型因其在隐私、延迟和通用硬件适配方面的优势而备受青睐,但它们在解决任务的同时满足复杂模式约束的能力有限。

The usual engineering assumption is that hard output constraints improve reliability without changing the underlying answer. We show that this assumption is unsafe for small models. We introduce \emph{constraint tax}, a measurement protocol for isolating the answer and executable-accuracy loss caused by structured-output constraints at fixed model, fixed task distribution, and fixed problem instances.

通常的工程假设认为,硬性输出约束可以在不改变底层答案的前提下提高可靠性。我们证明,对于小型模型而言,这一假设并不稳妥。我们引入了“约束税”(constraint tax)这一概念,这是一种测量协议,旨在隔离在固定模型、固定任务分布和固定问题实例下,由结构化输出约束所导致的答案准确率和可执行准确率的损失。

Across 15,000 commodity-GPU generations with Qwen2.5-0.5B, Qwen2.5-1.5B, and SmolLM2-1.7B, hard answer-only schema decoding raises schema validity from 61.5% to 100.0%, but lowers answer accuracy from 19.7% to 11.0% and increases wrong-valid-schema outputs from 49.5% to 88.9%. The strongest industry analogue is a deterministic calendar tool-call task: Qwen2.5-1.5B achieves 91.5% executable accuracy with prompt-only JSON but only 48.0% under the same hard tool-call schema, while both modes are 100.0% schema-valid. The error is semantic, not structural.

通过在 Qwen2.5-0.5B、Qwen2.5-1.5B 和 SmolLM2-1.7B 上进行的 15,000 次通用 GPU 生成测试发现,强制性的仅答案模式解码将模式有效性从 61.5% 提升到了 100.0%,但却将答案准确率从 19.7% 降低到了 11.0%,并将“错误但符合模式”的输出比例从 49.5% 增加到了 88.9%。最典型的行业案例是确定性日历工具调用任务:Qwen2.5-1.5B 在仅使用提示词生成 JSON 时可达到 91.5% 的可执行准确率,但在相同的硬性工具调用模式约束下仅能达到 48.0%,尽管两种模式下的模式有效性均为 100.0%。这种错误是语义层面的,而非结构层面的。

We also show that the 3B boundary still pays a direct-schema tax and that delayed packaging supports a constructive design pattern: reason free, constrain late. The practical conclusion is direct: production systems should report schema validity, answer accuracy, executable accuracy, and wrong-valid-schema rate separately.

我们还指出,30 亿参数规模的模型边界依然需要支付“直接模式税”,并证明了延迟封装(delayed packaging)支持一种建设性的设计模式:先自由推理,后施加约束。其实践结论非常直接:生产系统应当分别报告模式有效性、答案准确率、可执行准确率以及“错误但符合模式”的输出比率。