In general, speech signal processing involves processing electrical and/or electronic signals for recognition or synthesis of speech. Speech synthesis includes production of speech from text, and text-to-speech (TTS) systems provide an alternative to conventional computer-to-human visual output devices like computer monitors or displays. Conversely, speech recognition includes translation of speech into text, and automatic speech recognition (ASR) systems provide an alternative to conventional human-to-computer tactile input devices such as keyboards or keypads.
TTS and ASR technologies may be combined to provide a user with hands-free audible interaction with a system. For example, a telematics system in a vehicle may receive text messages, e-mails, tweets, or the like, use TTS technology to present them in audible form for a driver, receive a verbal response from the driver, and use ASR technology to convert the verbal response to machine readable form for carrying out vehicle control, or textual form for reply as a text message, e-mail, tweet, or the like.
But one problem encountered with TTS technology is that synthesized speech sounds undesirably artificial, and not like a natural human voice. For example, TTS-synthesized speech can have poor prosodic characteristics, such as intonation, pronunciation, stress, articulation rate, tone, and naturalness. Poor prosody can lead to confusion or disappointment of a TTS user and, thus, result in incomplete interaction with the user. To improve TTS quality, one solution includes collection and use of significantly more recorded voice data, and another solution includes development of more sophisticated TTS processing algorithms. But those solutions are time consuming and costly.