An Investigation of Video Vision Transformers for Depression Severity Estimation from Facial Video Data

Ghazal Bargshady, Roland Goecke

Research output: A Conference proceeding or a Chapter in BookConference contributionpeer-review


Recognising depression from facial expressions and movements in video data using machine learning models has gained considerable attention in recent years. Researchers have explored various approaches and techniques to develop models capable of detecting depression-related patterns in facial video data. Recently, Video Vision Transformers have emerged as a powerful deep learning architecture for analysing sequential data, such as video data. While vision transformers have primarily gained attention in computer vision tasks involving images, their application to video analysis tasks, such as the recognition of depression or the estimation of depression severity from facial video data, is an active area of research. In this paper, two different architectures of vision transformers are used to capture spatio-temporal, facial information relevant to estimating the severity of depression and, thus, to provide valuable insights for depression analysis. The models are trained and evaluated on the AVEC2013 and AVEC2014 datasets. The results indicate that the fine-tuned vision transformers can outperform earlier deep learning models in visual depression analysis, achieving a Root Mean Square Error (RMSE) of 5.73 for the vision transformer and 5.39 for the video vision transformers, respectively.
Original languageEnglish
Title of host publicationImage and Video Technology - 11th Pacific-Rim Symposium, PSIVT 2023, Proceedings
Subtitle of host publication11th Pacific-Rim Symposium, PSIVT 2023, Auckland, New Zealand, November 22–24, 2023, Proceedings
EditorsWei Qi Yan, Minh Nguyen, Parma Nand, Xuejun Li
Place of PublicationSingapore
Number of pages10
ISBN (Electronic)9789819703760
ISBN (Print)9789819703753
Publication statusPublished - 2024
Event11th Pacific-Rim Symposium on Image and Video Technology (PSIVT 2023) - Auckland University of Technology, Auckland, New Zealand
Duration: 22 Nov 202324 Nov 2023

Publication series

NameLecture Notes in Computer Science (LNCS)
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


Conference11th Pacific-Rim Symposium on Image and Video Technology (PSIVT 2023)
Abbreviated titlePSIVT 2023
Country/TerritoryNew Zealand
Internet address


Dive into the research topics of 'An Investigation of Video Vision Transformers for Depression Severity Estimation from Facial Video Data'. Together they form a unique fingerprint.

Cite this