The Constraint Tax: Measuring Validity-Correctness Tradeoffs in Structured Outputs for Small Language Models

约束税：衡量小型语言模型结构化输出中有效性与正确性的权衡

Abstract: Production LLM systems increasingly require machine-readable outputs: JSON objects, typed traces, regex constrained fields, and tool-call schemas. This paper targets on-device and low-cost small language model (SLM) deployments, where sub-3B models are attractive for privacy, latency, and commodity hardware but have limited capacity to satisfy schemas while solving tasks.

摘要： 生产环境中的大语言模型（LLM）系统越来越需要机器可读的输出：如 JSON 对象、类型化追踪、正则表达式约束字段以及工具调用模式。本文针对端侧和低成本的小型语言模型（SLM）部署场景，在这些场景中，30 亿参数以下（sub-3B）的模型因其在隐私、延迟和通用硬件适配方面的优势而备受青睐，但它们在解决任务的同时满足复杂模式约束的能力有限。

The usual engineering assumption is that hard output constraints improve reliability without changing the underlying answer. We show that this assumption is unsafe for small models. We introduce \emph{constraint tax}, a measurement protocol for isolating the answer and executable-accuracy loss caused by structured-output constraints at fixed model, fixed task distribution, and fixed problem instances.

通常的工程假设认为，硬性输出约束可以在不改变底层答案的前提下提高可靠性。我们证明，对于小型模型而言，这一假设并不稳妥。我们引入了“约束税”（constraint tax）这一概念，这是一种测量协议，旨在隔离在固定模型、固定任务分布和固定问题实例下，由结构化输出约束所导致的答案准确率和可执行准确率的损失。

Across 15,000 commodity-GPU generations with Qwen2.5-0.5B, Qwen2.5-1.5B, and SmolLM2-1.7B, hard answer-only schema decoding raises schema validity from 61.5% to 100.0%, but lowers answer accuracy from 19.7% to 11.0% and increases wrong-valid-schema outputs from 49.5% to 88.9%. The strongest industry analogue is a deterministic calendar tool-call task: Qwen2.5-1.5B achieves 91.5% executable accuracy with prompt-only JSON but only 48.0% under the same hard tool-call schema, while both modes are 100.0% schema-valid. The error is semantic, not structural.

通过在 Qwen2.5-0.5B、Qwen2.5-1.5B 和 SmolLM2-1.7B 上进行的 15,000 次通用 GPU 生成测试发现，强制性的仅答案模式解码将模式有效性从 61.5% 提升到了 100.0%，但却将答案准确率从 19.7% 降低到了 11.0%，并将“错误但符合模式”的输出比例从 49.5% 增加到了 88.9%。最典型的行业案例是确定性日历工具调用任务：Qwen2.5-1.5B 在仅使用提示词生成 JSON 时可达到 91.5% 的可执行准确率，但在相同的硬性工具调用模式约束下仅能达到 48.0%，尽管两种模式下的模式有效性均为 100.0%。这种错误是语义层面的，而非结构层面的。

We also show that the 3B boundary still pays a direct-schema tax and that delayed packaging supports a constructive design pattern: reason free, constrain late. The practical conclusion is direct: production systems should report schema validity, answer accuracy, executable accuracy, and wrong-valid-schema rate separately.

我们还指出，30 亿参数规模的模型边界依然需要支付“直接模式税”，并证明了延迟封装（delayed packaging）支持一种建设性的设计模式：先自由推理，后施加约束。其实践结论非常直接：生产系统应当分别报告模式有效性、答案准确率、可执行准确率以及“错误但符合模式”的输出比率。