Abstract
Body language behavior, including gestures and fine-grained movements not only reflects human emotions, but also serves as a versatile cue for enhancing emotional intelligence and creating responsive technologies. In this work, we explore the efficacy of multiview-multimodal cues for explainable prediction of bodily behavior. This paper proposes an attention fusion method that combines features extracted from (1) multiview videos termed “RGB”, (2) their multiview Discrete Cosine Transform representations termed “DCT” and (3) three stream skeleton features termed “Skeleton”, via a transformer-based approach. We evaluate our approach on the diverse BBSI [1] and Drive&Act [2] datasets. Empirical results confirm that the RGB, DCT and Skeleton features enable discovery of multiple class-specific behaviors resulting in explainable predictions. Our key findings are: (a) Multimodal approaches outperform unimodal counterparts in categorizing bodily behavioral classes; (b) Efficient class predictions and plausible explanations are achieved with both unimodal and multimodal approaches; and (c) Empirical results confirm the superiority of our approach compared to state-of-the-art methods on both datasets. Our implementation code is available at: https://github.com/surbhimadan92/MAGIC_TBR_Extended
Original language | English |
---|---|
Pages (from-to) | 1 - 12 |
Number of pages | 12 |
Journal | IEEE Transactions on Affective Computing |
DOIs | |
Publication status | Published - 6 Mar 2025 |