Prosody prediction in text-to-speech (TTS) system has a great influence on the naturalness of the synthesized speech. The current TTS systems adopt either corpus-based (optimal unit selection) approach or HMM-based statistics one. In general, HMM-based approach can achieve more consistent results as compared with corpus-based one. Moreover, the trained speech models by using HMM are usually small in size, e.g. 3 MB. With these advantages over the corpus-based approach, the HMM-based approach has recently become popular. Nevertheless, this approach suffers from an over-smoothing problem on the generation of prosody. Some documents disclosed a global variance method to ameliorate the problem. They indeed obtained positive results; however, this method shows no auditory preference if only the fundamental frequency (F0) is considered without prosody or spectrum.
The recent documents disclosed some methods to enhance the expressive capability of TTS. These methods usually require considerable efforts on the collection of various speaking styles of corpora. In addition, they also need lots of post-processing tasks, e.g. phonetic labeling and segmentation checking. In other words, the construction of a prosody-rich TTS system is quite time-consuming. As a consequence, some documents proposed to provide TTS systems with diverse prosody information via some additional tools. For example, a tool-based system could provide users with a plurality of manners to modify prosody, e.g. a GUI for users to adjust the pitch contour, and re-synthesize speech according to the new pitch information or using markup language to alter the prosody. However, most people do not know how to revise pitch contours correctly through a GUI tool. Similarly, few people are familiar with the usage of XML tags. Therefore, such the tool-based systems are inconvenient to use in practice.
Several patents regarding TTS are also published. For instance, monitoring TTS output quality to effect control of barge-in, controlling reading speed in a TTS system, a Mandarin prosody transformation system, concatenation-based Mandarin TTS with prosody control, TTS prosody prediction method and speech synthesis system, etc.
For example, FIG. 1 shows a Mandarin prosody transformation system 100 which uses a prosody analysis unit 130 to receive a source speech and the corresponding text. Prosody information can be extracted by the prosody analysis unit that is composed of a hierarchical decomposition module 131, a prosody transformation function selection module 132 and a prosody transformation module 133. Finally, the prosody information is sent to the speech synthesis module 150 so as to generate the synthesized speech.
FIG. 2 shows a speech synthesis system and method. The document disclosed a TTS system with foreign language capabilities. The system analyzes input text data 200 to obtain language information 204a by applying language analysis module 204 at the beginning. Next, the linguistic information is passed to a prosody prediction module 209 to generate the prosody information 209a. Then a speech-unit selection module 208 selects a sequence of speech segments that better matched the linguistic and prosody information. Finally, a speech synthesis module 210 is used to synthesize speech 211.