Operating-Layer Controls for Onchain Language-Model Agents Under Real Capital

真实资本环境下链上语言模型代理的操作层控制

Abstract: We study reliability in autonomous language-model agents that translate user mandates into validated tool actions under real capital. The setting is DX Terminal Pro, a 21-day deployment in which 3,505 user-funded agents traded real ETH in a bounded onchain market. Users configured vaults through structured controls and natural-language strategies, but only agents could choose normal buy/sell trades.

摘要： 我们研究了在真实资本环境下，将用户指令转化为已验证工具操作的自主语言模型代理的可靠性。研究背景为 DX Terminal Pro，这是一个为期 21 天的部署项目，其中 3,505 个由用户注资的代理在一个受限的链上市场中交易真实的 ETH。用户通过结构化控制和自然语言策略配置金库，但只有代理能够选择具体的买入/卖出交易。

The system produced 7.5M agent invocations, roughly 300K onchain actions, about $20M in volume, more than 5,000 ETH deployed, roughly 70B inference tokens, and 99.9% settlement success for policy-valid submitted transactions. Long-running agents accumulated thousands of sequential decisions, including 6,000+ prompt-state-action cycles for continuously active agents, yielding a large-scale trace from user mandate to rendered prompt, reasoning, validation, portfolio state, and settlement.

该系统产生了 750 万次代理调用、约 30 万次链上操作、约 2000 万美元的交易量、超过 5000 ETH 的部署资金、约 700 亿个推理 Token，且对于符合策略的已提交交易，结算成功率高达 99.9%。长期运行的代理积累了数千次连续决策，其中持续活跃的代理完成了超过 6000 个“提示-状态-行动”循环，从而产生了一条从用户指令到渲染提示、推理、验证、投资组合状态及最终结算的大规模追踪记录。

Reliability did not come from the base model alone; it emerged from the operating layer around the model: prompt compilation, typed controls, policy validation, execution guards, memory design, and trace-level observability. Pre-launch testing exposed failures that text-only benchmarks rarely measure, including fabricated trading rules, fee paralysis, numeric anchoring, cadence trading, and misread tokenomics.

可靠性并非仅源于基础模型，而是源于模型周围的操作层：提示词编译、类型化控制、策略验证、执行防护、内存设计以及追踪级可观测性。发布前的测试暴露了纯文本基准测试中罕见的故障，包括虚构交易规则、手续费瘫痪、数字锚定、节奏交易以及对代币经济学的误读。

Targeted harness changes reduced fabricated sell rules from 57% to 3%, reduced fee-led observations from 32.5% to below 10%, and increased capital deployment from 42.9% to 78.0% in an affected test population. We show that capital-managing agents should be evaluated across the full path from user mandate to prompt, validated action, and settlement.

针对性的测试框架改进将虚构卖出规则的比例从 57% 降低至 3%，将手续费导致的观察偏差从 32.5% 降低至 10% 以下，并使受影响测试群体中的资本部署率从 42.9% 提升至 78.0%。我们证明，对于管理资本的代理，应从用户指令到提示词、已验证操作及最终结算的全路径进行评估。