Ultrasound-based Articulatory-to-Acoustic Mapping for Silent Speech Interface Applications
* Presenting author
Abstract:
Articulatory-to-Acoustic Mapping methods can synthesize speech from articulatory recordings, i.e., from the movement of the speaking organs like the tongue. The potential target application is referred to as 'Silent Speech Interface' (SSI), which has the idea that one can articulate and mouth silently while a device is recording this articulatory movement, and the computer will generate speech. We experimented with four Hungarian speakers and tested 2D Ultrasound Tongue Imaging (UTI), which can record the movement of the tongue at roughly 100 fps. Using UTI input, we trained several deep neural networks (e.g., feedforward, convolutional, recurrent, autoencoders) to synthesize speech using a traditional vocoder. Next, we compared the results with a continuous vocoder, i.e., which has a continuous F0 model. Besides, we tested the applicability of the WaveGlow neural vocoder. We evaluated the final synthesized sentences using objective measures (e.g., Mel-Cepstral Distortion) and subjective listening tests. We found that the combination of the neural vocoder with recurrent (LSTM) or convolutional (3D-CNN) networks achieves the best results. Such an SSI system can be useful for the speaking impaired (e.g. after laryngectomy), and for scenarios where regular speech is not feasible, but information should be transmitted from the speaker (e.g. extremely noisy environments).