1. Field of the Invention
The present invention relates to synthesizing audible speech from textual content. More specifically, the invention relates to the training and application of prosody models for speech synthesis.
2. Description of the Related Art
Text-to-speech (TTS) synthesis systems generate audible speech from text. These systems typically attempt to generate natural-sounding speech, speech that sounds as if a person had uttered it. For high-quality synthesized speech, natural, variable prosody is important. Without it, speech, especially speech of long duration, sounds flat and artificial. Furthermore, a single style of prosody will not provide the variability that is common in human speech. Different circumstances often suggest different prosody. For example, a newscaster's speech that reports breaking news will typically have different characteristics than speech at a corporate meeting. However, typical TTS systems do not provide for rich prosody styles. As a result, the speech these systems generate lacks the additional natural variation, emphasis, and color that richer prosody could provide.