Noisy audio feature enhancement using audio-visual speech data

Roland Goecke, Gerasimos Potamianos, Chalapathy Neti

Research output: A Conference proceeding or a Chapter in BookConference contributionpeer-review

30 Citations (Scopus)
66 Downloads (Pure)

Abstract

We investigate improving automatic speech recognition (ASR) in noisy conditions by enhancing noisy audio features using visual speech captured from the speaker's face. The enhancement is achieved by applying a linear filter to the concatenated vector of noisy audio and visual features, obtained by mean square error estimation of the clean audio features in a training stage. The performance of the enhanced audio features is evaluated on two ASR tasks: A connected digits task and speaker-independent, large-vocabulary, continuous speech recognition. In both cases and at sufficiently low signal-to-noise ratios (SNRs), ASR trained on the enhanced audio features significantly outperforms ASR trained on the noisy audio, achieving for example a 46% relative reduction in word error rate on the digits task at -3.5 dB SNR. However, the method fails to capture the full visual modality benefit to ASR, as demonstrated by its comparison to discriminant audio-visual feature fusion introduced in previous work.
Original languageEnglish
Title of host publicationICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
PublisherIEEE
Pages2025-2028
Number of pages4
Volume2
ISBN (Print)0-7803-7402-9
DOIs
Publication statusPublished - 2002
Event2002 IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP 2002 - Salt Lake City, United States
Duration: 12 May 200217 May 2002

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume2

Conference

Conference2002 IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP 2002
Abbreviated titleICASSP 2012
Country/TerritoryUnited States
CitySalt Lake City
Period12/05/0217/05/02

Fingerprint

Dive into the research topics of 'Noisy audio feature enhancement using audio-visual speech data'. Together they form a unique fingerprint.

Cite this