Conventional text-to-speech (TTS) techniques use a single voice font. This voice font is trained with a recording corpus obtained from one voice talent. The resulting voice font strongly corresponds to the prosody and characteristics used by the voice talent when recording the corpus. Accordingly, when being recorded, the voice talent must use the same style and emotion that is desired in the TTS voice.
As the use of TTS becomes more prevalent, the flexibility of the TTS voice becomes increasingly important in various application scenarios. For example, an interactive application utilizing TTS to communicate with the user should provide the user with the ability to select from multiple voice personalities that are able to express rich emotion types and speaking styles. As TTS applications become more conversational and personal, the ability of the TTS application to adapt the speech style and/or the emotion of the speech of a single voice to match the conversational content is also desirable.
To get recordings covering a variety of emotions and styles for even a single voice is costly. Obtaining the desirable variety of recordings for multiple voices is not only costly, but impracticable. Attempts to transplant an emotion or speaking style from one recording/voice font to other voice fonts using conventional voice adaptation techniques have resulted in poor quality voice fonts that fail to convey the desired emotion and/or style and has highlighted the close relationship between the original recording and the emotion and/or style used by the voice talent. It is with respect to these and other considerations that the present invention has been made. Although relatively specific problems have been discussed, it should be understood that the embodiments disclosed herein should not be limited to solving the specific problems identified in the background.