Audio-Video Automatic Speech Recognition: An example of improved performance through multimodal sensor input

    Research output: A Conference proceeding or a Chapter in BookConference contribution

    Abstract

    One of the advantages of multimodal HCI technology
    is the performance improvement that can be gained
    over conventional single-modality technology by employing complementary sensors in different modalities. Such information is particular useful in practical, real-world applications where the application’s
    performance must be robust against all kinds of noise.
    An example is the domain of automatic speech recognition (ASR). Traditionally, ASR systems only use information from the audio modality. In the presence of
    acoustic noise, the performance drops quickly. However, it can and has been shown that incorporating
    additional visual speech information from the video
    modality improves the performance significantly, so
    that AV ASR systems can be employed in applications areas where audio-only ASR systems would fail,
    thus opening new application areas for ASR technology. In this paper, a non-intrusive (no artificial markers), real-time 3D lip tracking system is presented as
    well as its application to AV ASR. The multivariate
    statistical analysis ‘co-inertia analysis’ is also shown,
    which offers improved numerical stability over other
    multivariate analyses even for small sample sizes.
    Original languageEnglish
    Title of host publicationProceedings of the NICTA-HCSNet Multimodal User Interaction Workshop
    EditorsFang Chen, Julian Epps
    Place of PublicationAustralia
    PublisherAustralian Computer Society
    Pages25-32
    Number of pages8
    ISBN (Print)1445-1336
    Publication statusPublished - 2005
    EventMMUI2005 - Sydney, Australia
    Duration: 13 Sep 200514 Sep 2005

    Conference

    ConferenceMMUI2005
    CountryAustralia
    CitySydney
    Period13/09/0514/09/05

    Fingerprint

    Speech recognition
    Sensors
    Information use
    Convergence of numerical methods
    Human computer interaction
    Acoustic noise

    Cite this

    Goecke, R. (2005). Audio-Video Automatic Speech Recognition: An example of improved performance through multimodal sensor input. In F. Chen, & J. Epps (Eds.), Proceedings of the NICTA-HCSNet Multimodal User Interaction Workshop (pp. 25-32). Australia: Australian Computer Society.
    Goecke, Roland. / Audio-Video Automatic Speech Recognition: An example of improved performance through multimodal sensor input. Proceedings of the NICTA-HCSNet Multimodal User Interaction Workshop. editor / Fang Chen ; Julian Epps. Australia : Australian Computer Society, 2005. pp. 25-32
    @inproceedings{2570febdcdae4ca1b650d840bf7c5a04,
    title = "Audio-Video Automatic Speech Recognition: An example of improved performance through multimodal sensor input",
    abstract = "One of the advantages of multimodal HCI technologyis the performance improvement that can be gainedover conventional single-modality technology by employing complementary sensors in different modalities. Such information is particular useful in practical, real-world applications where the application’sperformance must be robust against all kinds of noise.An example is the domain of automatic speech recognition (ASR). Traditionally, ASR systems only use information from the audio modality. In the presence ofacoustic noise, the performance drops quickly. However, it can and has been shown that incorporatingadditional visual speech information from the videomodality improves the performance significantly, sothat AV ASR systems can be employed in applications areas where audio-only ASR systems would fail,thus opening new application areas for ASR technology. In this paper, a non-intrusive (no artificial markers), real-time 3D lip tracking system is presented aswell as its application to AV ASR. The multivariatestatistical analysis ‘co-inertia analysis’ is also shown,which offers improved numerical stability over othermultivariate analyses even for small sample sizes.",
    author = "Roland Goecke",
    year = "2005",
    language = "English",
    isbn = "1445-1336",
    pages = "25--32",
    editor = "Fang Chen and Julian Epps",
    booktitle = "Proceedings of the NICTA-HCSNet Multimodal User Interaction Workshop",
    publisher = "Australian Computer Society",
    address = "Australia",

    }

    Goecke, R 2005, Audio-Video Automatic Speech Recognition: An example of improved performance through multimodal sensor input. in F Chen & J Epps (eds), Proceedings of the NICTA-HCSNet Multimodal User Interaction Workshop. Australian Computer Society, Australia, pp. 25-32, MMUI2005, Sydney, Australia, 13/09/05.

    Audio-Video Automatic Speech Recognition: An example of improved performance through multimodal sensor input. / Goecke, Roland.

    Proceedings of the NICTA-HCSNet Multimodal User Interaction Workshop. ed. / Fang Chen; Julian Epps. Australia : Australian Computer Society, 2005. p. 25-32.

    Research output: A Conference proceeding or a Chapter in BookConference contribution

    TY - GEN

    T1 - Audio-Video Automatic Speech Recognition: An example of improved performance through multimodal sensor input

    AU - Goecke, Roland

    PY - 2005

    Y1 - 2005

    N2 - One of the advantages of multimodal HCI technologyis the performance improvement that can be gainedover conventional single-modality technology by employing complementary sensors in different modalities. Such information is particular useful in practical, real-world applications where the application’sperformance must be robust against all kinds of noise.An example is the domain of automatic speech recognition (ASR). Traditionally, ASR systems only use information from the audio modality. In the presence ofacoustic noise, the performance drops quickly. However, it can and has been shown that incorporatingadditional visual speech information from the videomodality improves the performance significantly, sothat AV ASR systems can be employed in applications areas where audio-only ASR systems would fail,thus opening new application areas for ASR technology. In this paper, a non-intrusive (no artificial markers), real-time 3D lip tracking system is presented aswell as its application to AV ASR. The multivariatestatistical analysis ‘co-inertia analysis’ is also shown,which offers improved numerical stability over othermultivariate analyses even for small sample sizes.

    AB - One of the advantages of multimodal HCI technologyis the performance improvement that can be gainedover conventional single-modality technology by employing complementary sensors in different modalities. Such information is particular useful in practical, real-world applications where the application’sperformance must be robust against all kinds of noise.An example is the domain of automatic speech recognition (ASR). Traditionally, ASR systems only use information from the audio modality. In the presence ofacoustic noise, the performance drops quickly. However, it can and has been shown that incorporatingadditional visual speech information from the videomodality improves the performance significantly, sothat AV ASR systems can be employed in applications areas where audio-only ASR systems would fail,thus opening new application areas for ASR technology. In this paper, a non-intrusive (no artificial markers), real-time 3D lip tracking system is presented aswell as its application to AV ASR. The multivariatestatistical analysis ‘co-inertia analysis’ is also shown,which offers improved numerical stability over othermultivariate analyses even for small sample sizes.

    M3 - Conference contribution

    SN - 1445-1336

    SP - 25

    EP - 32

    BT - Proceedings of the NICTA-HCSNet Multimodal User Interaction Workshop

    A2 - Chen, Fang

    A2 - Epps, Julian

    PB - Australian Computer Society

    CY - Australia

    ER -

    Goecke R. Audio-Video Automatic Speech Recognition: An example of improved performance through multimodal sensor input. In Chen F, Epps J, editors, Proceedings of the NICTA-HCSNet Multimodal User Interaction Workshop. Australia: Australian Computer Society. 2005. p. 25-32