In this paper we propose a robust audio-visual speech-andspeaker recognition system with liveness checks based on audio-visual fusion of audio-lip motion and depth features. The liveness verification feature added here guards the system against advanced spoofing attempts such as manufactured or replayed videos. For visual features, a new tensor-based representation of lip motion features, extracted from an intensity and depth subspace of 3D video sequences, is fused used with the audio features. A multilevel fusion paradigm involving first a Support Vector Machine for speech (digit) recognition and then a Gaussian Mixture Model for speaker verification with liveness checks allowed a significant performance improvement over single-mode features. Experimental evaluation for different scenarios with AVOZES, a 3D stereovision speaking-face database, shows favourable results with recognition accuracies of 70-90% for the digit recognition task, and EERs of 5% and 3% for the speaker verification and liveness check tasks respectively.
|Title of host publication||Proceedings of Interspeech 2008 Conference Incorporating SST 2008|
|Editors||Janet Fletcher, DeborahLoakes Roland Goecke, Denis Burnham, Michael Wagner|
|Place of Publication||Australia|
|Publisher||International Speech Communication Association|
|Number of pages||4|
|Publication status||Published - 2008|
|Event||Interspeech 2008 - Brisbane, Australia|
Duration: 22 Sep 2008 → 26 Sep 2008
|Period||22/09/08 → 26/09/08|
Chetty, G., & Wagner, M. (2008). Audio-Visual Multilevel Fusion for Speech and Speaker Recognition. In J. Fletcher, D. R. Goecke, D. Burnham, & M. Wagner (Eds.), Proceedings of Interspeech 2008 Conference Incorporating SST 2008 (pp. 379-382). International Speech Communication Association.