TY - JOUR
T1 - StethoSpeech
T2 - Speech Generation Through a Clinical Stethoscope Attached to the Skin
AU - Shah, Neil
AU - Sahipjohn, Neha
AU - Tambrahalli, Vishal
AU - Subramanian, Ramanathan
AU - Gandhi, Vineet
N1 - Publisher Copyright:
© 2024 Copyright held by the owner/author(s).
PY - 2024/9/9
Y1 - 2024/9/9
N2 - We introduce StethoSpeech, a silent speech interface that transforms flesh-conducted vibrations behind the ear into speech. This innovation is designed to improve social interactions for those with voice disorders, and furthermore enable discreet public communication. Unlike prior efforts, StethoSpeech does not require (a) paired-speech data for recorded vibrations and (b) a specialized device for recording vibrations, as it can work with an off-the-shelf clinical stethoscope. The novelty of our framework lies in the overall design, simulation of the ground-truth speech, and a sequence-to-sequence translation network, which works in the latent space. We present comprehensive experiments on the existing CSTR NAM TIMIT Plus corpus and our proposed StethoText: a large-scale synchronized database of non-audible murmur and text for speech research. Our results show that StethoSpeech provides natural-sounding and intelligible speech, significantly outperforming existing methods on several quantitative and qualitative metrics. Additionally, we showcase its capacity to extend its application to speakers not encountered during training and its effectiveness in challenging, noisy environments.
AB - We introduce StethoSpeech, a silent speech interface that transforms flesh-conducted vibrations behind the ear into speech. This innovation is designed to improve social interactions for those with voice disorders, and furthermore enable discreet public communication. Unlike prior efforts, StethoSpeech does not require (a) paired-speech data for recorded vibrations and (b) a specialized device for recording vibrations, as it can work with an off-the-shelf clinical stethoscope. The novelty of our framework lies in the overall design, simulation of the ground-truth speech, and a sequence-to-sequence translation network, which works in the latent space. We present comprehensive experiments on the existing CSTR NAM TIMIT Plus corpus and our proposed StethoText: a large-scale synchronized database of non-audible murmur and text for speech research. Our results show that StethoSpeech provides natural-sounding and intelligible speech, significantly outperforming existing methods on several quantitative and qualitative metrics. Additionally, we showcase its capacity to extend its application to speakers not encountered during training and its effectiveness in challenging, noisy environments.
KW - artificial learning
KW - HuBERT
KW - NAM-to-speech conversion
KW - self-supervised learning
KW - silent speech
KW - StethoSpeech
KW - zero-pair setting
UR - http://www.scopus.com/inward/record.url?scp=85203657938&partnerID=8YFLogxK
U2 - 10.1145/3678515
DO - 10.1145/3678515
M3 - Article
AN - SCOPUS:85203657938
SN - 2474-9567
VL - 8
SP - 1
EP - 21
JO - Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
JF - Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
IS - 3
ER -