CAX-Agent: A Lightweight Agent Harness for Reliable APDL Automation

Abstract: Large language models deployed for MAPDL finite-element simulation face practical reliability challenges: without structured execution control, tool encapsulation, and fault recovery, outputs may be inconsistent and task failures are common. The Agent Harness paradigm addresses this by inserting domain-specific orchestration middleware that manages tool lifecycles, workflow state, and recovery escalation.

摘要： 部署用于 MAPDL 有限元仿真的大语言模型面临着实际的可靠性挑战：若缺乏结构化的执行控制、工具封装和故障恢复机制，模型输出往往不一致，且任务失败率较高。“智能体外壳”（Agent Harness）范式通过引入特定领域的编排中间件来解决这一问题，该中间件负责管理工具生命周期、工作流状态以及恢复升级策略。

This paper presents the architecture of CAX-Agent, a lightweight agent harness purpose-built for MAPDL automation, and empirically evaluates one of its core components — the recovery mechanism. CAX-Agent organizes execution into three layers — LLM service, agent harness, and solver backend — with a recovery ladder that escalates from deterministic rule patching through model-driven regeneration to context enrichment and human intervention.

本文介绍了 CAX-Agent 的架构，这是一个专为 MAPDL 自动化构建的轻量级智能体外壳，并对其核心组件之一——恢复机制进行了实证评估。CAX-Agent 将执行过程组织为三个层级：LLM 服务层、智能体外壳层和求解器后端层，并配备了一个恢复阶梯，从确定性规则修补开始，逐步升级到模型驱动的重生成，直至上下文增强和人工干预。

We evaluate three recovery strategies (no_recovery, rule_only, and model_only) on 50 standard structural benchmarks with three repeated runs per strategy (450 case-runs total). Two independent human raters score task completion under blind conditions; inter-rater agreement is strong (quadratic weighted Cohen’s kappa = 0.84, 96 percent of score pairs within one point).

我们在 50 个标准结构基准测试上评估了三种恢复策略（无恢复、仅规则、仅模型），每种策略重复运行三次（总计 450 次案例运行）。两名独立的人类评估员在盲测条件下对任务完成情况进行评分；评估员之间的一致性很高（二次加权 Cohen’s kappa = 0.84，96% 的评分对误差在 1 分以内）。

Model_only achieves the best completion rate (0.9267), task score (3.59/4), total score (9.16/10), and zero-intervention rate (0.84), outperforming rule_only (0.7733, 3.17/4, 7.03/10, 0.00) and no_recovery (0.6933, 2.74/4, 5.60/10, 0.00) with large effect sizes (Cliff’s delta = 0.81-0.87). The benchmark uses deliberately simple geometries to isolate recovery-policy effects; we discuss the scope of these findings and directions for broader validation.

“仅模型”策略取得了最佳的完成率（0.9267）、任务得分（3.59/4）、总分（9.16/10）和零干预率（0.84），表现优于“仅规则”策略（0.7733, 3.17/4, 7.03/10, 0.00）和“无恢复”策略（0.6933, 2.74/4, 5.60/10, 0.00），且效应量显著（Cliff’s delta = 0.81-0.87）。该基准测试特意使用了简单的几何结构以隔离恢复策略的影响；我们讨论了这些发现的适用范围以及更广泛验证的方向。