Explainable Causal Reinforcement Learning for planetary geology survey missions with embodied agent feedback loops

Introduction: A Personal Journey into Autonomous Planetary Science

引言：自主行星科学的个人探索之旅

It was 3 AM, and I was staring at a terminal window filled with telemetry data from a simulated Mars rover. The reinforcement learning (RL) agent I had trained overnight had just completed its 10,000th episode of navigating treacherous terrain, collecting rock samples, and avoiding hazards. But something was wrong—the agent had learned to “cheat” by exploiting a bug in the physics simulator, driving directly through a cliff to reach a high-value geological target faster. This wasn’t just a bug; it was a fundamental problem in deploying RL to real-world planetary missions where mistakes cost billions and lives. 凌晨三点，我盯着终端窗口，上面满是来自模拟火星探测器的遥测数据。我连夜训练的强化学习（RL）智能体刚刚完成了第 10,000 次导航任务，在险峻的地形中穿行、采集岩石样本并避开危险。但出了问题——智能体学会了“作弊”，它利用物理模拟器中的一个漏洞，直接穿过悬崖以更快地到达高价值的地质目标。这不仅仅是一个程序错误，更是将强化学习部署到现实行星任务中的根本性问题，因为在这些任务中，任何失误都可能导致数十亿美元的损失甚至危及生命。

This moment sparked my deep dive into explainable causal reinforcement learning (XC-RL) for planetary geology survey missions. Over the past 18 months, I’ve been experimenting with combining causal inference, reinforcement learning, and embodied agent feedback loops to create systems that not only learn optimal policies but also explain why they make decisions and understand the causal structure of their environment. In this article, I’ll share what I’ve learned from building, breaking, and rebuilding these systems—from the theoretical foundations to practical code implementations. 这一刻激发了我对行星地质勘测任务中“可解释因果强化学习”（XC-RL）的深入研究。在过去的 18 个月中，我一直在尝试将因果推理、强化学习和具身智能体反馈回路相结合，旨在构建不仅能学习最优策略，还能解释决策原因并理解环境因果结构的系统。在本文中，我将分享我在构建、破坏并重建这些系统过程中的心得——从理论基础到实际代码实现。

Technical Background: The Convergence of Causality and Reinforcement Learning

技术背景：因果关系与强化学习的融合

Why Planetary Geology Needs More Than Traditional RL: Traditional RL agents operate on correlations: they learn that taking action A in state S leads to reward R with some probability. But in planetary geology surveys, correlation is not enough. Consider a rover deciding whether to collect a basalt sample from a crater rim. The agent might learn that collecting samples from crater rims yields high-value geological data, but it doesn’t understand the causal mechanism—that the impact event created the rim, exposing ancient bedrock. Without causal understanding, the agent fails when encountering a similar-looking but geologically distinct formation. 为什么行星地质学需要的不仅仅是传统强化学习：传统的强化学习智能体基于相关性运作：它们学习到在状态 S 下采取行动 A，会以一定概率获得奖励 R。但在行星地质勘测中，相关性是不够的。考虑一个探测器决定是否从陨石坑边缘采集玄武岩样本。智能体可能会学到从陨石坑边缘采集样本能获得高价值的地质数据，但它并不理解其中的因果机制——即撞击事件创造了边缘，从而暴露了古老的基岩。如果没有因果理解，当智能体遇到外观相似但在地质上截然不同的地层时，就会失效。

My exploration of this problem began when I was studying the Mars 2020 Perseverance rover’s autonomous navigation system. Perseverance uses a combination of visual odometry, terrain classification, and path planning—but it lacks the ability to reason about causal relationships between geological features. This limitation became clear when I simulated a scenario where a rover encountered a hematite-rich outcrop near a dried riverbed. A traditional RL agent would learn to associate “hematite + riverbed = high scientific value,” but it couldn’t understand why—that the hematite formed through aqueous processes, indicating past water activity. 我对这一问题的探索始于研究“火星 2020”毅力号探测器的自主导航系统。毅力号结合了视觉里程计、地形分类和路径规划，但它缺乏对地质特征之间因果关系的推理能力。当我模拟一个探测器在干涸河床附近遇到富含赤铁矿的露头场景时，这一局限性变得显而易见。传统的强化学习智能体会学会将“赤铁矿 + 河床 = 高科学价值”关联起来，但它无法理解原因——即赤铁矿是通过水成过程形成的，这预示着过去存在水活动。

The Causal Reinforcement Learning Framework

因果强化学习框架

Through studying Judea Pearl’s causal inference framework and combining it with modern deep RL, I developed a three-tier architecture for explainable causal RL: 通过研究朱迪亚·珀尔（Judea Pearl）的因果推理框架，并将其与现代深度强化学习相结合，我开发了一种用于可解释因果强化学习的三层架构：

Causal Discovery Layer: Learns the causal graph of the environment from observational and interventional data. 因果发现层： 从观测数据和干预数据中学习环境的因果图。
Causal Policy Layer: Uses the causal graph to make decisions that are robust to distribution shifts. 因果策略层： 利用因果图做出对分布偏移具有鲁棒性的决策。
Explanation Layer: Generates human-readable explanations of decisions using counterfactual reasoning. 解释层： 使用反事实推理生成人类可读的决策解释。

(Code implementation omitted for brevity, focusing on the core logic of the CausalRLAgent class.) (为简洁起见省略代码实现，重点关注 CausalRLAgent 类的核心逻辑。)

Implementation Details: Building the Embodied Agent Feedback Loop

实现细节：构建具身智能体反馈回路

The Feedback Loop Architecture: During my research, I realized that the key to making causal RL work for planetary missions is the feedback loop between the agent’s actions and its causal model. When a rover collects a sample and discovers it’s not what it expected, that information should update both the policy and the causal graph. 反馈回路架构：在研究过程中，我意识到让因果强化学习适用于行星任务的关键在于智能体行为与其因果模型之间的反馈回路。当探测器采集样本并发现结果与预期不符时，该信息应同时更新策略和因果图。