Adaptive Geodesic Conformal Prediction for Egocentric Camera Pose Estimation

用于第一人称视角相机位姿估计的自适应测地线共形预测

Abstract: Egocentric pose estimation for Augmented Reality (AR) and assistive devices requires not just accurate predictions but guaranteed uncertainty regions. 摘要： 用于增强现实（AR）和辅助设备的第一人称视角（Egocentric）位姿估计，不仅需要精确的预测，还需要提供有保证的不确定性区域。

Conformal prediction (CP) provides such guarantees without retraining, but we show that standard CP with a single fixed threshold achieves nominal 90% overall coverage while covering only ~60% of the hardest 25% of frames (Q4) — a ~30 percentage-point conditional coverage gap consistent across 12 participants, 3 predictors, and 3 horizons (108 evaluations) on EPIC-Fields. 共形预测（Conformal Prediction, CP）无需重新训练即可提供此类保证。然而，我们研究发现，使用单一固定阈值的标准 CP 虽然能达到名义上 90% 的总体覆盖率，但在最困难的 25% 帧（Q4）中仅能覆盖约 60% —— 在 EPIC-Fields 数据集上，针对 12 名参与者、3 种预测器和 3 个预测时域（共 108 次评估）的测试显示，存在约 30 个百分点的条件覆盖率差距。

We further show that a geodesic SE(3) nonconformity score identifies physically harder frames than Euclidean scoring, with only 15-26% Q4 overlap and 2-3x higher ground-truth camera displacement for geodesic Q4 frames. 我们进一步证明，与欧几里得评分相比，测地线 SE(3) 非一致性评分（geodesic SE(3) nonconformity score）能识别出物理上更困难的帧；测地线 Q4 帧与欧几里得 Q4 帧的重叠率仅为 15-26%，且其真实相机位移高出 2-3 倍。

To close the coverage gap, we propose DINOv2-Bridge adaptive CP: a two-stage difficulty estimator trained on a single source participant that transfers cross-participant without any images at test time, improving Q4 coverage from ~0.75 to ~0.93 while maintaining overall coverage at the 90% target. 为了弥补这一覆盖率差距，我们提出了 DINOv2-Bridge 自适应 CP：这是一种两阶段难度估计器，仅需在单个源参与者的数据上进行训练，即可在测试时无需任何图像的情况下实现跨参与者迁移，将 Q4 覆盖率从约 0.75 提升至约 0.93，同时保持 90% 的总体覆盖率目标。