Audio-Visual Multilevel Fusion for Speech and Speaker Recognition

Girija Chetty, Michael Wagner

    Research output: A Conference proceeding or a Chapter in BookConference contribution

    Abstract

    In this paper we propose a robust audio-visual speech-andspeaker recognition system with liveness checks based on audio-visual fusion of audio-lip motion and depth features. The liveness verification feature added here guards the system against advanced spoofing attempts such as manufactured or replayed videos. For visual features, a new tensor-based representation of lip motion features, extracted from an intensity and depth subspace of 3D video sequences, is fused used with the audio features. A multilevel fusion paradigm involving first a Support Vector Machine for speech (digit) recognition and then a Gaussian Mixture Model for speaker verification with liveness checks allowed a significant performance improvement over single-mode features. Experimental evaluation for different scenarios with AVOZES, a 3D stereovision speaking-face database, shows favourable results with recognition accuracies of 70-90% for the digit recognition task, and EERs of 5% and 3% for the speaker verification and liveness check tasks respectively.
    Original languageEnglish
    Title of host publicationProceedings of Interspeech 2008 Conference Incorporating SST 2008
    EditorsJanet Fletcher, DeborahLoakes Roland Goecke, Denis Burnham, Michael Wagner
    Place of Publication Australia
    PublisherInternational Speech Communication Association
    Pages379-382
    Number of pages4
    ISBN (Print)9781615673780
    Publication statusPublished - 2008
    EventInterspeech 2008 - Brisbane, Australia
    Duration: 22 Sep 200826 Sep 2008

    Conference

    ConferenceInterspeech 2008
    CountryAustralia
    CityBrisbane
    Period22/09/0826/09/08

    Fingerprint

    Fusion reactions
    Speech recognition
    Tensors
    Support vector machines

    Cite this

    Chetty, G., & Wagner, M. (2008). Audio-Visual Multilevel Fusion for Speech and Speaker Recognition. In J. Fletcher, D. R. Goecke, D. Burnham, & M. Wagner (Eds.), Proceedings of Interspeech 2008 Conference Incorporating SST 2008 (pp. 379-382). Australia: International Speech Communication Association.
    Chetty, Girija ; Wagner, Michael. / Audio-Visual Multilevel Fusion for Speech and Speaker Recognition. Proceedings of Interspeech 2008 Conference Incorporating SST 2008. editor / Janet Fletcher ; DeborahLoakes Roland Goecke ; Denis Burnham ; Michael Wagner. Australia : International Speech Communication Association, 2008. pp. 379-382
    @inproceedings{b65459d27d124533969275076f8fba52,
    title = "Audio-Visual Multilevel Fusion for Speech and Speaker Recognition",
    abstract = "In this paper we propose a robust audio-visual speech-andspeaker recognition system with liveness checks based on audio-visual fusion of audio-lip motion and depth features. The liveness verification feature added here guards the system against advanced spoofing attempts such as manufactured or replayed videos. For visual features, a new tensor-based representation of lip motion features, extracted from an intensity and depth subspace of 3D video sequences, is fused used with the audio features. A multilevel fusion paradigm involving first a Support Vector Machine for speech (digit) recognition and then a Gaussian Mixture Model for speaker verification with liveness checks allowed a significant performance improvement over single-mode features. Experimental evaluation for different scenarios with AVOZES, a 3D stereovision speaking-face database, shows favourable results with recognition accuracies of 70-90{\%} for the digit recognition task, and EERs of 5{\%} and 3{\%} for the speaker verification and liveness check tasks respectively.",
    author = "Girija Chetty and Michael Wagner",
    year = "2008",
    language = "English",
    isbn = "9781615673780",
    pages = "379--382",
    editor = "Janet Fletcher and Goecke, {DeborahLoakes Roland} and Denis Burnham and Michael Wagner",
    booktitle = "Proceedings of Interspeech 2008 Conference Incorporating SST 2008",
    publisher = "International Speech Communication Association",

    }

    Chetty, G & Wagner, M 2008, Audio-Visual Multilevel Fusion for Speech and Speaker Recognition. in J Fletcher, DR Goecke, D Burnham & M Wagner (eds), Proceedings of Interspeech 2008 Conference Incorporating SST 2008. International Speech Communication Association, Australia, pp. 379-382, Interspeech 2008, Brisbane, Australia, 22/09/08.

    Audio-Visual Multilevel Fusion for Speech and Speaker Recognition. / Chetty, Girija; Wagner, Michael.

    Proceedings of Interspeech 2008 Conference Incorporating SST 2008. ed. / Janet Fletcher; DeborahLoakes Roland Goecke; Denis Burnham; Michael Wagner. Australia : International Speech Communication Association, 2008. p. 379-382.

    Research output: A Conference proceeding or a Chapter in BookConference contribution

    TY - GEN

    T1 - Audio-Visual Multilevel Fusion for Speech and Speaker Recognition

    AU - Chetty, Girija

    AU - Wagner, Michael

    PY - 2008

    Y1 - 2008

    N2 - In this paper we propose a robust audio-visual speech-andspeaker recognition system with liveness checks based on audio-visual fusion of audio-lip motion and depth features. The liveness verification feature added here guards the system against advanced spoofing attempts such as manufactured or replayed videos. For visual features, a new tensor-based representation of lip motion features, extracted from an intensity and depth subspace of 3D video sequences, is fused used with the audio features. A multilevel fusion paradigm involving first a Support Vector Machine for speech (digit) recognition and then a Gaussian Mixture Model for speaker verification with liveness checks allowed a significant performance improvement over single-mode features. Experimental evaluation for different scenarios with AVOZES, a 3D stereovision speaking-face database, shows favourable results with recognition accuracies of 70-90% for the digit recognition task, and EERs of 5% and 3% for the speaker verification and liveness check tasks respectively.

    AB - In this paper we propose a robust audio-visual speech-andspeaker recognition system with liveness checks based on audio-visual fusion of audio-lip motion and depth features. The liveness verification feature added here guards the system against advanced spoofing attempts such as manufactured or replayed videos. For visual features, a new tensor-based representation of lip motion features, extracted from an intensity and depth subspace of 3D video sequences, is fused used with the audio features. A multilevel fusion paradigm involving first a Support Vector Machine for speech (digit) recognition and then a Gaussian Mixture Model for speaker verification with liveness checks allowed a significant performance improvement over single-mode features. Experimental evaluation for different scenarios with AVOZES, a 3D stereovision speaking-face database, shows favourable results with recognition accuracies of 70-90% for the digit recognition task, and EERs of 5% and 3% for the speaker verification and liveness check tasks respectively.

    M3 - Conference contribution

    SN - 9781615673780

    SP - 379

    EP - 382

    BT - Proceedings of Interspeech 2008 Conference Incorporating SST 2008

    A2 - Fletcher, Janet

    A2 - Goecke, DeborahLoakes Roland

    A2 - Burnham, Denis

    A2 - Wagner, Michael

    PB - International Speech Communication Association

    CY - Australia

    ER -

    Chetty G, Wagner M. Audio-Visual Multilevel Fusion for Speech and Speaker Recognition. In Fletcher J, Goecke DR, Burnham D, Wagner M, editors, Proceedings of Interspeech 2008 Conference Incorporating SST 2008. Australia: International Speech Communication Association. 2008. p. 379-382