TY - GEN
T1 - A Robust Spatio-temporal Face Modelling Approach Using 3D Multimodal Fusion for Biometric Security Applications
AU - Chetty, Girija
AU - Wagner, Michael
PY - 2008
Y1 - 2008
N2 - In this paper, we propose a robust spatio-temporal face modelling approach based on multilevel fusion strategy involving cascaded fusion of hybrid multimodal fusion of audio-lip-face motion, correlation and depth features for biometric security application. The proposed approach combines the information from different audio-video based modules, namely: audio-lip motion module, audio-lip correlation module, 2D+3D motion-depth fusion module, and performs a hybrid cascaded fusion in an automatic, unsupervised and adaptive manner, by adapting to the local performance of each module. This is done by taking the output-score based reliability estimates (confidence measures) of each of the module into account. The module weightings are determined automatically such that the reliability measure of the combined scores is maximised. To test the robustness of the proposed approach, the audio and visual speech (mouth) modalities are degraded to emulate various levels of train/test mismatch; employing additive white Gaussian noise for the audio and JPEG compression for the video signals. The results show improved fusion performance for a range of tested levels of audio and video degradation, compared to the individual module performances. Experiments on a 3D stereovision database AVOZES show that, at severe levels of audio and video mismatch, the audio, mouth, 3D face, and tri-module (audio-lip motion, correlation and depth) fusion EERs were 42.9%, 32%, 15%, and 7.3% respectively for biometric identity verification scenario
AB - In this paper, we propose a robust spatio-temporal face modelling approach based on multilevel fusion strategy involving cascaded fusion of hybrid multimodal fusion of audio-lip-face motion, correlation and depth features for biometric security application. The proposed approach combines the information from different audio-video based modules, namely: audio-lip motion module, audio-lip correlation module, 2D+3D motion-depth fusion module, and performs a hybrid cascaded fusion in an automatic, unsupervised and adaptive manner, by adapting to the local performance of each module. This is done by taking the output-score based reliability estimates (confidence measures) of each of the module into account. The module weightings are determined automatically such that the reliability measure of the combined scores is maximised. To test the robustness of the proposed approach, the audio and visual speech (mouth) modalities are degraded to emulate various levels of train/test mismatch; employing additive white Gaussian noise for the audio and JPEG compression for the video signals. The results show improved fusion performance for a range of tested levels of audio and video degradation, compared to the individual module performances. Experiments on a 3D stereovision database AVOZES show that, at severe levels of audio and video mismatch, the audio, mouth, 3D face, and tri-module (audio-lip motion, correlation and depth) fusion EERs were 42.9%, 32%, 15%, and 7.3% respectively for biometric identity verification scenario
U2 - 10.1117/12.778631
DO - 10.1117/12.778631
M3 - Conference contribution
VL - 6944
T3 - Proceedings of SPIE
SP - 1
EP - 5
BT - Proceedings of the SPIE, Biometric Technology for Human Identification V
A2 - Kumar, B.V.K Vijaya
A2 - Prabhakar, Salil
A2 - Ross, Arun
PB - SPIE
CY - Washington
T2 - SPIE Conference on Biometric Technology for Human Identification V
Y2 - 18 March 2008 through 19 March 2008
ER -