Noisy audio feature enhancement using audio-visual speech data

Roland Goecke, Gerasimos Potamianos, Chalapathy Neti

    Research output: A Conference proceeding or a Chapter in BookConference contribution

    22 Citations (Scopus)

    Abstract

    We investigate improving automatic speech recognition (ASR) in noisy conditions by enhancing noisy audio features using visual speech captured from the speaker's face. The enhancement is achieved by applying a linear filter to the concatenated vector of noisy audio and visual features, obtained by mean square error estimation of the clean audio features in a training stage. The performance of the enhanced audio features is evaluated on two ASR tasks: A connected digits task and speaker-independent, large-vocabulary, continuous speech recognition. In both cases and at sufficiently low signal-to-noise ratios (SNRs), ASR trained on the enhanced audio features significantly outperforms ASR trained on the noisy audio, achieving for example a 46% relative reduction in word error rate on the digits task at -3.5 dB SNR. However, the method fails to capture the full visual modality benefit to ASR, as demonstrated by its comparison to discriminant audio-visual feature fusion introduced in previous work.
    Original languageEnglish
    Title of host publicationICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
    PublisherIEEE
    Pages2025-2028
    Number of pages4
    Volume2
    ISBN (Print)0-7803-7402-9
    DOIs
    Publication statusPublished - 2002
    Event2002 IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP 2002 - Salt Lake City, United States
    Duration: 12 May 200217 May 2002

    Publication series

    NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
    Volume2

    Conference

    Conference2002 IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP 2002
    Abbreviated titleICASSP 2012
    CountryUnited States
    CitySalt Lake City
    Period12/05/0217/05/02

    Fingerprint

    Speech recognition
    Signal to noise ratio
    Continuous speech recognition
    Mean square error
    Error analysis
    Fusion reactions

    Cite this

    Goecke, R., Potamianos, G., & Neti, C. (2002). Noisy audio feature enhancement using audio-visual speech data. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings (Vol. 2, pp. 2025-2028). (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2). IEEE. https://doi.org/10.1109/ICASSP.2002.5745030
    Goecke, Roland ; Potamianos, Gerasimos ; Neti, Chalapathy. / Noisy audio feature enhancement using audio-visual speech data. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Vol. 2 IEEE, 2002. pp. 2025-2028 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).
    @inproceedings{3c1a6b009c9545aab7f1a05b7aa2fbf8,
    title = "Noisy audio feature enhancement using audio-visual speech data",
    abstract = "We investigate improving automatic speech recognition (ASR) in noisy conditions by enhancing noisy audio features using visual speech captured from the speaker's face. The enhancement is achieved by applying a linear filter to the concatenated vector of noisy audio and visual features, obtained by mean square error estimation of the clean audio features in a training stage. The performance of the enhanced audio features is evaluated on two ASR tasks: A connected digits task and speaker-independent, large-vocabulary, continuous speech recognition. In both cases and at sufficiently low signal-to-noise ratios (SNRs), ASR trained on the enhanced audio features significantly outperforms ASR trained on the noisy audio, achieving for example a 46{\%} relative reduction in word error rate on the digits task at -3.5 dB SNR. However, the method fails to capture the full visual modality benefit to ASR, as demonstrated by its comparison to discriminant audio-visual feature fusion introduced in previous work.",
    keywords = "audio-visual speech, noisy audio, feature enhancement",
    author = "Roland Goecke and Gerasimos Potamianos and Chalapathy Neti",
    year = "2002",
    doi = "10.1109/ICASSP.2002.5745030",
    language = "English",
    isbn = "0-7803-7402-9",
    volume = "2",
    series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",
    publisher = "IEEE",
    pages = "2025--2028",
    booktitle = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",

    }

    Goecke, R, Potamianos, G & Neti, C 2002, Noisy audio feature enhancement using audio-visual speech data. in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. vol. 2, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2, IEEE, pp. 2025-2028, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP 2002, Salt Lake City, United States, 12/05/02. https://doi.org/10.1109/ICASSP.2002.5745030

    Noisy audio feature enhancement using audio-visual speech data. / Goecke, Roland; Potamianos, Gerasimos; Neti, Chalapathy.

    ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Vol. 2 IEEE, 2002. p. 2025-2028 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2).

    Research output: A Conference proceeding or a Chapter in BookConference contribution

    TY - GEN

    T1 - Noisy audio feature enhancement using audio-visual speech data

    AU - Goecke, Roland

    AU - Potamianos, Gerasimos

    AU - Neti, Chalapathy

    PY - 2002

    Y1 - 2002

    N2 - We investigate improving automatic speech recognition (ASR) in noisy conditions by enhancing noisy audio features using visual speech captured from the speaker's face. The enhancement is achieved by applying a linear filter to the concatenated vector of noisy audio and visual features, obtained by mean square error estimation of the clean audio features in a training stage. The performance of the enhanced audio features is evaluated on two ASR tasks: A connected digits task and speaker-independent, large-vocabulary, continuous speech recognition. In both cases and at sufficiently low signal-to-noise ratios (SNRs), ASR trained on the enhanced audio features significantly outperforms ASR trained on the noisy audio, achieving for example a 46% relative reduction in word error rate on the digits task at -3.5 dB SNR. However, the method fails to capture the full visual modality benefit to ASR, as demonstrated by its comparison to discriminant audio-visual feature fusion introduced in previous work.

    AB - We investigate improving automatic speech recognition (ASR) in noisy conditions by enhancing noisy audio features using visual speech captured from the speaker's face. The enhancement is achieved by applying a linear filter to the concatenated vector of noisy audio and visual features, obtained by mean square error estimation of the clean audio features in a training stage. The performance of the enhanced audio features is evaluated on two ASR tasks: A connected digits task and speaker-independent, large-vocabulary, continuous speech recognition. In both cases and at sufficiently low signal-to-noise ratios (SNRs), ASR trained on the enhanced audio features significantly outperforms ASR trained on the noisy audio, achieving for example a 46% relative reduction in word error rate on the digits task at -3.5 dB SNR. However, the method fails to capture the full visual modality benefit to ASR, as demonstrated by its comparison to discriminant audio-visual feature fusion introduced in previous work.

    KW - audio-visual speech

    KW - noisy audio

    KW - feature enhancement

    U2 - 10.1109/ICASSP.2002.5745030

    DO - 10.1109/ICASSP.2002.5745030

    M3 - Conference contribution

    SN - 0-7803-7402-9

    VL - 2

    T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

    SP - 2025

    EP - 2028

    BT - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

    PB - IEEE

    ER -

    Goecke R, Potamianos G, Neti C. Noisy audio feature enhancement using audio-visual speech data. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Vol. 2. IEEE. 2002. p. 2025-2028. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). https://doi.org/10.1109/ICASSP.2002.5745030