Audio-Visual Multilevel Fusion for Speech and Speaker Recognition

Girija Chetty, Michael Wagner

Research output: A Conference proceeding or a Chapter in BookConference contributionpeer-review

33 Downloads (Pure)


In this paper we propose a robust audio-visual speech-andspeaker recognition system with liveness checks based on audio-visual fusion of audio-lip motion and depth features. The liveness verification feature added here guards the system against advanced spoofing attempts such as manufactured or replayed videos. For visual features, a new tensor-based representation of lip motion features, extracted from an intensity and depth subspace of 3D video sequences, is fused used with the audio features. A multilevel fusion paradigm involving first a Support Vector Machine for speech (digit) recognition and then a Gaussian Mixture Model for speaker verification with liveness checks allowed a significant performance improvement over single-mode features. Experimental evaluation for different scenarios with AVOZES, a 3D stereovision speaking-face database, shows favourable results with recognition accuracies of 70-90% for the digit recognition task, and EERs of 5% and 3% for the speaker verification and liveness check tasks respectively.
Original languageEnglish
Title of host publicationProceedings of Interspeech 2008 Conference Incorporating SST 2008
EditorsJanet Fletcher, DeborahLoakes Roland Goecke, Denis Burnham, Michael Wagner
Place of Publication Australia
PublisherInternational Speech Communication Association
Number of pages4
ISBN (Print)9781615673780
Publication statusPublished - 2008
EventInterspeech 2008 - Brisbane, Australia
Duration: 22 Sept 200826 Sept 2008


ConferenceInterspeech 2008


Dive into the research topics of 'Audio-Visual Multilevel Fusion for Speech and Speaker Recognition'. Together they form a unique fingerprint.

Cite this