CAX-Agent: A Lightweight Agent Harness for Reliable APDL Automation
CAX-Agent: A Lightweight Agent Harness for Reliable APDL Automation
Abstract: Large language models deployed for MAPDL finite-element simulation face practical reliability challenges: without structured execution control, tool encapsulation, and fault recovery, outputs may be inconsistent and task failures are common. The Agent Harness paradigm addresses this by inserting domain-specific orchestration middleware that manages tool lifecycles, workflow state, and recovery escalation.
摘要: 部署用于 MAPDL 有限元仿真的大语言模型面临着实际的可靠性挑战:若缺乏结构化的执行控制、工具封装和故障恢复机制,模型输出往往不一致,且任务失败率较高。“智能体外壳”(Agent Harness)范式通过引入特定领域的编排中间件来解决这一问题,该中间件负责管理工具生命周期、工作流状态以及恢复升级策略。
This paper presents the architecture of CAX-Agent, a lightweight agent harness purpose-built for MAPDL automation, and empirically evaluates one of its core components — the recovery mechanism. CAX-Agent organizes execution into three layers — LLM service, agent harness, and solver backend — with a recovery ladder that escalates from deterministic rule patching through model-driven regeneration to context enrichment and human intervention.
本文介绍了 CAX-Agent 的架构,这是一个专为 MAPDL 自动化构建的轻量级智能体外壳,并对其核心组件之一——恢复机制进行了实证评估。CAX-Agent 将执行过程组织为三个层级:LLM 服务层、智能体外壳层和求解器后端层,并配备了一个恢复阶梯,从确定性规则修补开始,逐步升级到模型驱动的重生成,直至上下文增强和人工干预。
We evaluate three recovery strategies (no_recovery, rule_only, and model_only) on 50 standard structural benchmarks with three repeated runs per strategy (450 case-runs total). Two independent human raters score task completion under blind conditions; inter-rater agreement is strong (quadratic weighted Cohen’s kappa = 0.84, 96 percent of score pairs within one point).
我们在 50 个标准结构基准测试上评估了三种恢复策略(无恢复、仅规则、仅模型),每种策略重复运行三次(总计 450 次案例运行)。两名独立的人类评估员在盲测条件下对任务完成情况进行评分;评估员之间的一致性很高(二次加权 Cohen’s kappa = 0.84,96% 的评分对误差在 1 分以内)。
Model_only achieves the best completion rate (0.9267), task score (3.59/4), total score (9.16/10), and zero-intervention rate (0.84), outperforming rule_only (0.7733, 3.17/4, 7.03/10, 0.00) and no_recovery (0.6933, 2.74/4, 5.60/10, 0.00) with large effect sizes (Cliff’s delta = 0.81-0.87). The benchmark uses deliberately simple geometries to isolate recovery-policy effects; we discuss the scope of these findings and directions for broader validation.
“仅模型”策略取得了最佳的完成率(0.9267)、任务得分(3.59/4)、总分(9.16/10)和零干预率(0.84),表现优于“仅规则”策略(0.7733, 3.17/4, 7.03/10, 0.00)和“无恢复”策略(0.6933, 2.74/4, 5.60/10, 0.00),且效应量显著(Cliff’s delta = 0.81-0.87)。该基准测试特意使用了简单的几何结构以隔离恢复策略的影响;我们讨论了这些发现的适用范围以及更广泛验证的方向。