TY - GEN
T1 - Noisy audio feature enhancement using audio-visual speech data
AU - Goecke, Roland
AU - Potamianos, Gerasimos
AU - Neti, Chalapathy
PY - 2002
Y1 - 2002
N2 - We investigate improving automatic speech recognition (ASR) in noisy conditions by enhancing noisy audio features using visual speech captured from the speaker's face. The enhancement is achieved by applying a linear filter to the concatenated vector of noisy audio and visual features, obtained by mean square error estimation of the clean audio features in a training stage. The performance of the enhanced audio features is evaluated on two ASR tasks: A connected digits task and speaker-independent, large-vocabulary, continuous speech recognition. In both cases and at sufficiently low signal-to-noise ratios (SNRs), ASR trained on the enhanced audio features significantly outperforms ASR trained on the noisy audio, achieving for example a 46% relative reduction in word error rate on the digits task at -3.5 dB SNR. However, the method fails to capture the full visual modality benefit to ASR, as demonstrated by its comparison to discriminant audio-visual feature fusion introduced in previous work.
AB - We investigate improving automatic speech recognition (ASR) in noisy conditions by enhancing noisy audio features using visual speech captured from the speaker's face. The enhancement is achieved by applying a linear filter to the concatenated vector of noisy audio and visual features, obtained by mean square error estimation of the clean audio features in a training stage. The performance of the enhanced audio features is evaluated on two ASR tasks: A connected digits task and speaker-independent, large-vocabulary, continuous speech recognition. In both cases and at sufficiently low signal-to-noise ratios (SNRs), ASR trained on the enhanced audio features significantly outperforms ASR trained on the noisy audio, achieving for example a 46% relative reduction in word error rate on the digits task at -3.5 dB SNR. However, the method fails to capture the full visual modality benefit to ASR, as demonstrated by its comparison to discriminant audio-visual feature fusion introduced in previous work.
KW - audio-visual speech
KW - noisy audio
KW - feature enhancement
U2 - 10.1109/ICASSP.2002.5745030
DO - 10.1109/ICASSP.2002.5745030
M3 - Conference contribution
SN - 0-7803-7402-9
VL - 2
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 2025
EP - 2028
BT - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
PB - IEEE
T2 - 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP 2002
Y2 - 12 May 2002 through 17 May 2002
ER -