Less Context, Better Agents: Efficient Context Engineering for Long-Horizon Tool-Using LLM Agents
Less Context, Better Agents: Efficient Context Engineering for Long-Horizon Tool-Using LLM Agents
更少的上下文,更强的智能体:面向长周期工具使用型 LLM 智能体的高效上下文工程
Abstract: Large language models deployed as autonomous agents for enterprise workflows face a key challenge: verbose tool responses from enterprise systems can cause context overflow, stale-state errors, and high inference cost. We study this problem in automated expense itemization in Microsoft Dynamics 365 Finance and Operations using Model Context Protocol tools.
摘要: 作为企业工作流自主智能体部署的大型语言模型面临一个关键挑战:来自企业系统的冗长工具响应可能导致上下文溢出、状态陈旧错误以及高昂的推理成本。我们通过使用模型上下文协议(Model Context Protocol)工具,研究了 Microsoft Dynamics 365 Finance and Operations 中自动化费用明细处理的问题。
We evaluate four GPT-5 configurations on a 50-task hotel expense benchmark: no user model, full conversation history, context pruned to the last 5 tool call/response pairs, and pruning with automated summarization. Results are averaged across 5 independent runs, with the user model held constant for the context-engineering comparison.
我们在包含 50 项任务的酒店费用基准测试中评估了四种 GPT-5 配置:无用户模型、完整对话历史、修剪至最后 5 对工具调用/响应的上下文,以及结合自动摘要的修剪方案。结果取自 5 次独立运行的平均值,在上下文工程对比中保持用户模型不变。
The no-user-model baseline achieves only 8.0% complete itemization. Full-context retention improves completion to 71.0%, but consumes 1,480,996 tokens and 14.56 hours per benchmark. Pruning to the last 5 tool calls improves completion to 79.0% while reducing token use to 535,274 and runtime to 5.39 hours. Adding summarization achieves the best result: 91.6% complete itemization and 99.64% average amount itemized, with 553,374 tokens and 5.79 hours.
无用户模型的基准测试仅实现了 8.0% 的完整明细处理。保留完整上下文将完成率提高到了 71.0%,但每个基准测试消耗了 1,480,996 个 Token,耗时 14.56 小时。修剪至最后 5 次工具调用将完成率提高到了 79.0%,同时将 Token 使用量减少至 535,274,运行时间缩短至 5.39 小时。增加摘要功能取得了最佳结果:91.6% 的完整明细处理率和 99.64% 的平均金额明细准确率,且仅消耗 553,374 个 Token,耗时 5.79 小时。
We further report confidence intervals, effect-size analysis, sensitivity over pruning and summary windows, failure analysis, results across five expense types grouped into three categories, and cross-model evidence with Claude Sonnet 4.5. These results show that, for this class of enterprise tool-use workflow, selective retention of recent tool interactions plus compact summarization can improve both reliability and efficiency compared with full-history retention.
我们进一步报告了置信区间、效应量分析、修剪与摘要窗口的敏感性分析、故障分析、分为三类的五种费用类型的结果,以及使用 Claude Sonnet 4.5 进行的跨模型验证。这些结果表明,对于此类企业工具使用工作流,与保留完整历史记录相比,选择性保留近期工具交互并结合精简摘要,可以同时提高可靠性和效率。