Abstract
In this paper we propose a robust audio-visual speech-andspeaker recognition system with liveness checks based on audio-visual fusion of audio-lip motion and depth features. The liveness verification feature added here guards the system against advanced spoofing attempts such as manufactured or replayed videos. For visual features, a new tensor-based representation of lip motion features, extracted from an intensity and depth subspace of 3D video sequences, is fused used with the audio features. A multilevel fusion paradigm involving first a Support Vector Machine for speech (digit) recognition and then a Gaussian Mixture Model for speaker verification with liveness checks allowed a significant performance improvement over single-mode features. Experimental evaluation for different scenarios with AVOZES, a 3D stereovision speaking-face database, shows favourable results with recognition accuracies of 70-90% for the digit recognition task, and EERs of 5% and 3% for the speaker verification and liveness check tasks respectively.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of Interspeech 2008 Conference Incorporating SST 2008 |
| Editors | Janet Fletcher, DeborahLoakes Roland Goecke, Denis Burnham, Michael Wagner |
| Place of Publication | Australia |
| Publisher | International Speech Communication Association |
| Pages | 379-382 |
| Number of pages | 4 |
| ISBN (Print) | 9781615673780 |
| Publication status | Published - 2008 |
| Event | Interspeech 2008 - Brisbane, Australia Duration: 22 Sept 2008 → 26 Sept 2008 |
Conference
| Conference | Interspeech 2008 |
|---|---|
| Country/Territory | Australia |
| City | Brisbane |
| Period | 22/09/08 → 26/09/08 |