Deep Temporal Modeling and Ensemble Fusion for Multimodal Emotion Recognition from Physiological Signals

基于生理信号多模态情感识别的深度时序建模与集成融合

Abstract: Physiological stress and emotion recognition are important for health monitoring and affective computing. In this work, we present a comprehensive evaluation of deep learning models such as Long Short-Term Memory (LSTM), Temporal Convolutional Networks (TCN), and Transformer on the WESAD dataset for multimodal affect recognition using wrist and chest sensor signals.

摘要： 生理压力与情感识别对于健康监测和情感计算至关重要。在这项工作中，我们针对 WESAD 数据集进行了全面的深度学习模型评估，包括长短期记忆网络 (LSTM)、时序卷积网络 (TCN) 和 Transformer，旨在利用腕部和胸部传感器信号实现多模态情感识别。

We perform ablation studies to assess the individual contributions of each modality by training models on wrist-only and chest-only inputs. In addition, we implement a late-fusion ensemble strategy that combines predictions from all three architectures trained on multimodal input. We also employ early fusion at the sensor level by concatenating wrist and chest signals before feeding them into each model.

我们通过在仅腕部和仅胸部输入上训练模型，进行了消融研究，以评估每种模态的个体贡献。此外，我们实现了一种后期融合集成策略，将三种架构在多模态输入下训练出的预测结果进行结合。我们还采用了传感器层面的早期融合，即在将腕部和胸部信号输入模型之前先进行拼接。

Our results show that Transformer models consistently achieve the highest accuracy in multimodal settings, while TCN models perform best in the wrist-only configuration. The ensemble method yields the highest overall accuracy (98.91 +/- 0.13%) and macro-F1 score (98.56 +/- 0.17%). These findings demonstrate the effectiveness of sensor fusion and ensemble-based fusion in developing robust systems for physiological emotion recognition.

研究结果表明，Transformer 模型在多模态设置下始终能达到最高准确率，而 TCN 模型在仅腕部配置下表现最佳。集成方法获得了最高的整体准确率 (98.91 +/- 0.13%) 和宏观 F1 分数 (98.56 +/- 0.17%)。这些发现证明了传感器融合和基于集成的融合在开发稳健的生理情感识别系统中的有效性。