Abstract
In this paper we propose a robust audio-visual speech-andspeaker recognition system with liveness checks based on audio-visual fusion of audio-lip motion and depth features. The liveness verification feature added here guards the system against advanced spoofing attempts such as manufactured or replayed videos. For visual features, a new tensor-based representation of lip motion features, extracted from an intensity and depth subspace of 3D video sequences, is fused used with the audio features. A multilevel fusion paradigm involving first a Support Vector Machine for speech (digit) recognition and then a Gaussian Mixture Model for speaker verification with liveness checks allowed a significant performance improvement over single-mode features. Experimental evaluation for different scenarios with AVOZES, a 3D stereovision speaking-face database, shows favourable results with recognition accuracies of 70-90% for the digit recognition task, and EERs of 5% and 3% for the speaker verification and liveness check tasks respectively.
Original language | English |
---|---|
Title of host publication | Proceedings of Interspeech 2008 Conference Incorporating SST 2008 |
Editors | Janet Fletcher, DeborahLoakes Roland Goecke, Denis Burnham, Michael Wagner |
Place of Publication | Australia |
Publisher | International Speech Communication Association |
Pages | 379-382 |
Number of pages | 4 |
ISBN (Print) | 9781615673780 |
Publication status | Published - 2008 |
Event | Interspeech 2008 - Brisbane, Australia Duration: 22 Sept 2008 → 26 Sept 2008 |
Conference
Conference | Interspeech 2008 |
---|---|
Country/Territory | Australia |
City | Brisbane |
Period | 22/09/08 → 26/09/08 |