3D Lip Tracking and Co-inertia Analysis for Improved Robustness of Audio-Video Automatic Speech Recognition

Roland Goecke

    Research output: A Conference proceeding or a Chapter in BookConference contributionpeer-review

    5 Citations (Scopus)

    Abstract

    Multimodality is a key issue in robust human-computer interaction. The joint use of audio and video speech variables has been shown to improve the performance of automatic speech recognition (ASR) systems. However, robust methods in particular for the real-time extraction of video speech features are still an open research area. This paper addresses the robustness issue of audio-video (AV) ASR systems by exploring a real-time 3D lip tracking algorithm based on stereo vision and by investigating how learned statistical relationships between the sets of audio and video speech variables can be employed in AV ASR systems. The 3D lip tracking algorithm combines colour information from each cameras' images with knowledge about the structure of the mouth region for different degrees of mouth openness. By using a calibrated stereo camera system, 3D coordinates of facial features can be recovered, so that the visual speech variable measurements become independent from the head pose. Multivariate statistical analyses enable the analysis of relationships between sets of variables. Co-inertia analysis is a relatively new method and has not yet been widely used in AVSP research. Its advantage is its superior numerical stability compared to other multivariate methods in the case of small sample size. Initial results are presented, which show how 3D video speech information and learned statistical relationships between audio and video speech variables can improve the performance of AV ASR systems.

    Original languageEnglish
    Title of host publicationProceedings of the Auditory-Visual Speech Processing Workshop
    EditorsEric Vatikiotis-Bateson
    Place of PublicationCanada
    PublisherISCA
    Pages109-114
    Number of pages6
    Publication statusPublished - 2005
    EventAVSP 2005 - Vancouver, Australia
    Duration: 24 Jul 200527 Jul 2005

    Conference

    ConferenceAVSP 2005
    Country/TerritoryAustralia
    CityVancouver
    Period24/07/0527/07/05

    Fingerprint

    Dive into the research topics of '3D Lip Tracking and Co-inertia Analysis for Improved Robustness of Audio-Video Automatic Speech Recognition'. Together they form a unique fingerprint.

    Cite this