A text-to-speech engine is a software program that generates speech from inputted text. A text-to-speech engine may be useful in applications that use synthesized speech, such as a wireless communication device that reads incoming text messages, a global positioning system (GPS) that provides voice directional guidance, or other portable electronic devices that present information as audio speech.
Many text-to-speech engines use Hidden Markov Model (HMM) based text-to-speech synthesis. A variety of contextual factors may affect the quality of synthesized of human speech. For instance, parameters such as spectrum, pitch and duration may interact with one another during speech synthesis. Thus, important contextual factors for speech synthesis may include, but are not limited to, phone identity, stress, accent, position. In HMM-based speech synthesis, the label of the HMMs may be composed of a combination of these contextual factors. Moreover, conventional HMM-based speech synthesis also uses a universal Maximum Likelihood (ML) criterion during both training and synthesis. The ML criterion is capable of estimating statistical parameters of the HMMs. The ML criterion may also impose a static-dynamic parameter constraint during speech synthesis, which may help to generate a smooth parametric trajectory that yields highly intelligible speech.
However, speech synthesized using conventional HMM-based approaches may be overly smooth, as ML parameter estimation after decision tree-based tying usually leads to highly averaged HMM parameters. Thus, speech synthesized using the conventional HMM-based approaches may become blurred and muffled. In other words, the quality of the synthesized speech may be degraded.