Field
The disclosure relates to techniques for text-to-speech conversion with emotional content.
Background
Computer speech synthesis is an increasingly common human interface feature found in modern computing devices. In many applications, the emotional impression conveyed by the synthesized speech is important to the overall user experience. The perceived emotional content of speech may be affected by such factors as the rhythm and prosody of the synthesized speech.
Text-to-speech techniques commonly ignore the emotional content of synthesized speech altogether by generating only emotionally “neutral” renditions of a given script. Alternatively, text-to-speech techniques may utilize separate voice models for separate emotion types, leading to the relatively high costs associated with storing separate voice models in memory corresponding to the many emotion types. Such techniques are also inflexible when it comes to generating speech with emotional content for which no voice models are readily available.
Accordingly, it would be desirable to provide novel and efficient techniques for text-to-speech conversion with emotional content.