Less Context, Better Agents: Efficient Context Engineering for Long-Horizon Tool-Using LLM Agents

更少的上下文，更强的智能体：面向长周期工具使用型 LLM 智能体的高效上下文工程

Abstract: Large language models deployed as autonomous agents for enterprise workflows face a key challenge: verbose tool responses from enterprise systems can cause context overflow, stale-state errors, and high inference cost. We study this problem in automated expense itemization in Microsoft Dynamics 365 Finance and Operations using Model Context Protocol tools.

摘要： 作为企业工作流自主智能体部署的大型语言模型面临一个关键挑战：来自企业系统的冗长工具响应可能导致上下文溢出、状态陈旧错误以及高昂的推理成本。我们通过使用模型上下文协议（Model Context Protocol）工具，研究了 Microsoft Dynamics 365 Finance and Operations 中自动化费用明细处理的问题。

We evaluate four GPT-5 configurations on a 50-task hotel expense benchmark: no user model, full conversation history, context pruned to the last 5 tool call/response pairs, and pruning with automated summarization. Results are averaged across 5 independent runs, with the user model held constant for the context-engineering comparison.

我们在包含 50 项任务的酒店费用基准测试中评估了四种 GPT-5 配置：无用户模型、完整对话历史、修剪至最后 5 对工具调用/响应的上下文，以及结合自动摘要的修剪方案。结果取自 5 次独立运行的平均值，在上下文工程对比中保持用户模型不变。

The no-user-model baseline achieves only 8.0% complete itemization. Full-context retention improves completion to 71.0%, but consumes 1,480,996 tokens and 14.56 hours per benchmark. Pruning to the last 5 tool calls improves completion to 79.0% while reducing token use to 535,274 and runtime to 5.39 hours. Adding summarization achieves the best result: 91.6% complete itemization and 99.64% average amount itemized, with 553,374 tokens and 5.79 hours.

无用户模型的基准测试仅实现了 8.0% 的完整明细处理。保留完整上下文将完成率提高到了 71.0%，但每个基准测试消耗了 1,480,996 个 Token，耗时 14.56 小时。修剪至最后 5 次工具调用将完成率提高到了 79.0%，同时将 Token 使用量减少至 535,274，运行时间缩短至 5.39 小时。增加摘要功能取得了最佳结果：91.6% 的完整明细处理率和 99.64% 的平均金额明细准确率，且仅消耗 553,374 个 Token，耗时 5.79 小时。

We further report confidence intervals, effect-size analysis, sensitivity over pruning and summary windows, failure analysis, results across five expense types grouped into three categories, and cross-model evidence with Claude Sonnet 4.5. These results show that, for this class of enterprise tool-use workflow, selective retention of recent tool interactions plus compact summarization can improve both reliability and efficiency compared with full-history retention.

我们进一步报告了置信区间、效应量分析、修剪与摘要窗口的敏感性分析、故障分析、分为三类的五种费用类型的结果，以及使用 Claude Sonnet 4.5 进行的跨模型验证。这些结果表明，对于此类企业工具使用工作流，与保留完整历史记录相比，选择性保留近期工具交互并结合精简摘要，可以同时提高可靠性和效率。