Audio-Video Automatic Speech Recognition: An example of improved performance through multimodal sensor input

    Research output: A Conference proceeding or a Chapter in BookConference contribution

    Abstract

    One of the advantages of multimodal HCI technology
    is the performance improvement that can be gained
    over conventional single-modality technology by employing complementary sensors in different modalities. Such information is particular useful in practical, real-world applications where the application’s
    performance must be robust against all kinds of noise.
    An example is the domain of automatic speech recognition (ASR). Traditionally, ASR systems only use information from the audio modality. In the presence of
    acoustic noise, the performance drops quickly. However, it can and has been shown that incorporating
    additional visual speech information from the video
    modality improves the performance significantly, so
    that AV ASR systems can be employed in applications areas where audio-only ASR systems would fail,
    thus opening new application areas for ASR technology. In this paper, a non-intrusive (no artificial markers), real-time 3D lip tracking system is presented as
    well as its application to AV ASR. The multivariate
    statistical analysis ‘co-inertia analysis’ is also shown,
    which offers improved numerical stability over other
    multivariate analyses even for small sample sizes.
    Original languageEnglish
    Title of host publicationProceedings of the NICTA-HCSNet Multimodal User Interaction Workshop
    EditorsFang Chen, Julian Epps
    Place of PublicationAustralia
    PublisherAustralian Computer Society
    Pages25-32
    Number of pages8
    ISBN (Print)1445-1336
    Publication statusPublished - 2005
    EventMMUI2005 - Sydney, Australia
    Duration: 13 Sep 200514 Sep 2005

    Conference

    ConferenceMMUI2005
    CountryAustralia
    CitySydney
    Period13/09/0514/09/05

    Fingerprint Dive into the research topics of 'Audio-Video Automatic Speech Recognition: An example of improved performance through multimodal sensor input'. Together they form a unique fingerprint.

  • Cite this

    Goecke, R. (2005). Audio-Video Automatic Speech Recognition: An example of improved performance through multimodal sensor input. In F. Chen, & J. Epps (Eds.), Proceedings of the NICTA-HCSNet Multimodal User Interaction Workshop (pp. 25-32). Australian Computer Society. https://crpit.scem.westernsydney.edu.au/confpapers/CRPITV57Goecke.pdf