When a plurality of Chinese characters forms into words or phrases to be consecutively pronounced, they affect one another to form comparatively separated and complete prosodic blocks, the prosodic characteristics of which have very important function on the naturalness of the speech. The combination of different prosodic blocks usually forms different tunes to render a person's pronunciation in possession of different tones. Generally speaking, the main prosodic units in the Chinese speech include prosodic words, prosodic phrases and intonational phrases. The prosody of the Chinese language is of a layered structure, and such a layered prosodic structure forms the rhythm (prosody) of the Chinese speech. The boundary of a prosodic unit usually corresponds to the stop, the change in fundamental frequency or the change in audio duration of a prosodic boundary syllable in the speech. Prosody is an important factor affecting the naturalness and comprehensibleness of a synthesized speech. In the speech synthesis system, the prosodic structure provides the prosodic parameter prediction model with very important information, whereby the objective of controlling the mode of pronunciation of the speech synthesis system is achieved through prediction of such parameters as the fundamental frequency, the audio duration (duration) and the stop etc., so as to achieve the corresponding prosodic effect of the prosodic units at each level in the synthesized speech, to thereby render the pronunciation natural and melodious.
With the ever deeper development of linguistic processing, people need not only to learn more about the prosodic structure of the natural speech, but also try to find a method for predicting the prosodic structure from the text, so as to enhance the naturalness of the synthesized speech or the preciseness of the speech recognition in a more effective manner, and deepen the degree for understanding natural languages at the same time.
The prosodic word denotes a group of syllables that are consecutively pronounced in an audio stream, and the pronunciations between these syllables are very closely related and there is no stop to the audial perception. The prosodic word is an element of the lowest level in the layered structure of the prosody, and there is usually a perceptible stop at the boundary of the prosodic word. In other words, there is no perceptible stop inside the prosodic word, as the stop merely appears at the boundary of the prosodic word. Not all prosodic word boundaries have stops in the actual speech. It is acceptable when there is a perceptible stop at the boundary of the prosodic word, but any perceptible stop inside the prosodic word will render the speech either hard to understand or unnatural. Consequently, a good prosodic word forming module is of great significance to enhancing the naturalness of the synthesized speech.
There have been many published dissertations and patents in the prior art, such as those presented below, relating to the studies on the prosodic word forming module and the enhancement of the naturalness of the synthesized speech.                U.S. Pat. No. 6,996,529 (Minnis; Stephen; Feb. 7, 2006, Speech synthesis with prosodic phrase boundary information);        U.S. Pat. No. 6,173,262 (Hirschberg; Julia; Jan. 9, 2001, Text-to-speech system with automatically trained phrasing rules);        U.S. Pat. No. 6,003,005 (Hirschberg; Julia; Dec. 14, 1999, Text-to-speech system and a method and apparatus for training the same based upon intonational feature annotations of input text);        U.S. Pat. No. 5,850,629 (Holm; Frode; Pearson; Steve; Dec. 15, 1998, User interface controller for text-to-speech synthesizer);        U.S. Pat. No. 6,978,239 (Chu; Min; Peng; Hu; Dec. 20, 2005, Method and apparatus for speech synthesis without prosody modification);        Document, Shih, C. L., “The Prosodic Domain of Tone Sandhi in Mandarin Chinese”, PhD Dissertation, UC San Diego, 1986;        Document, Chu M. and Qian Y., “Locating boundaries for prosodic constituents in unrestricted Mandarin texts”, Journal of Computational Linguistics and Chinese Language Processing, 6(1), 61-82, 2001;        Document, Dong H., Tao J. and Xu b., “Prosodic word prediction using the lexical information”, International Conference on Natural Language Processing and Knowledge Engineering, Wuhan, 2005;        Document, Shao Y., Han, J., Liu T. and Zhao Y., “Prosodic word boundaries prediction for Mandarin text-to-speech”, International Symposium on Tonal Aspects of Languages with Emphasis on Tone Languages, 159-162, Beijing, 2004;        Document, Dong M., Lua K. T. and Li H., “A probabilistic approach to prosodic word prediction for Mandarin Chinese TTS”, 9th European Conference on Speech Communication and Technology, Lisbon, Portugal, 2005;        Document, Qin Shi and XiJun Ma, 2002. “Statistic prosody structure prediction”, International Conference of the IEEE 2002 Workshop on Speech Synthesis, Santa Monica, Calif., 2002; and        Document, Ying, Z., and Shi, X., “An RNN-based algorithm to detect prosodic phrase for Chinese TTS”, International Conference on Acoustic, Speech and Signal Processing, 2001.        
The contents of these patents and documents are incorporated herein as prior art documents of the present application for invention.
In general cases, the Chinese speech synthesis system consists of three modules, namely a text analyzing module, a prosody parameter predicting module and a backend synthesizing module. The Chinese text analyzing module includes word segmentation, part of speech annotation, phonetic notation, and prosodic structure prediction, etc. The first step is word segmentation. This is so because, unlike the texts of other languages such as the English, there is no space as a separating sign between words in the Chinese text to divide the words. Word segmentation is generally based on the analysis of the part of speech, to thereby not only reflect a certain syntactic structure but also slightly differ from the prosodic structure. The purpose of prosodic structure prediction is to find out an effective method to map the contents of the text as a prosodic structure, in order to construct a prediction model from the text to the prosodic characteristics (such as the stop and the tune) to guide the subsequent generation of prosodyparameters.
Many studies show that the prosodic words are greatly different from the words of the lexicology. One reason is that the forming of the prosodic words is based not only on the meanings of the words but also on the prosodic requirements of the speech. A prosodic word can contain more than one word as defined in the lexicology, and can also be a part of a relatively long word defined in the lexicology. The word dividing module and the part of speech annotating module perform the word segmentation and the corresponding part of speech annotation on the text of the natural language based on the knowledge of lexicology.
The following sample sentence describes two processing steps of the text analyzing module, namely word segmentation/part of speech annotation and prosodic structure prediction. As shown in FIG. 1:
A text is input as:  (once at an extramural activity in which we and the pupils of other schools climbed the Fragrance Hill, no one of us lagged behind, as all climbed to the hilltop by leaps and bounds)”.
The words are divided and the parts of speech are annotated as: /v -/m /q , /w /r /p /f /Ng /v /v /v /ns, /w /r /u   -/m /q /v /u, /w /o /d /v /v /u /n  /w”.
The prosodic structure is as: /v -/m /q|∥/r /c/f /Ng∥/v /v|/v /ns|∥/r /u|/n∥ /v -/m /q|/v /u|∥/o∥ /d /v /v /u| /n|∥”.
The “|” indicates the boundary of the prosodic word, the “∥” indicates the boundary of the prosodic phrase, and the “|∥” indicates the boundary of the intonational phrase. The boundary of the prosodic phrase and the boundary of the intonational phrase is of necessity also a boundary of the prosodic word. The task of the prosodic word forming module is to determine the boundary of the prosodic word on the basis of the word segmentation and the part of speech annotation. In addition, the prosodic word forming is also the footstone for the prediction of a prosodic unit of higher level, such as the prediction of a prosodic phrase. Consequently, the stand or fall of the prosodic word forming is of very great significance to the naturalness of the synthesized speech.
Several methods have been proposed in the prior art for the prediction of the boundaries of the Chinese prosodic words, such as the Classification and Regression Tree (CART) method, rule-driven approach, statistical approach and recurrent neural network (RNN) method etc. Part of Speech (POS) and word length information are widely employed in these methods.
Generally speaking, it cannot be said that the prediction of the prosodic word boundaries is very precise in the state of the art. Errors of the boundary prediction are usually generalized into two types: one is the insertion error, and another one is the deletion error. As discussed above, not all prosodic word boundaries have stops in the actual speech. It is acceptable when there is a perceptible stop at the boundary of the prosodic word, but any perceptible stop inside the prosodic word will render the speech either hard to understand or unnatural. Therefore, the type of insertion error engendered by the prosodic word forming module will bring great harm to the synthesized speech. To the contrary, the type of deletion error brings far less harm to the synthesized speech. For instance, the word segmentation result of the last portion of the aforementioned sample sentence,  (climbed to . . . by leaps and bounds)”, is  (see as shown in FIG. 1), in which the words , ,  and  are all single-character words. They should be combined together to become a complete prosodic word,  (climbed to . . . )”. If they are not combined together at the level of the prosodic word, this section of the speech in the synthesized speech will be very unnatural to the audial perception. In the synthesized speech, they are to the audial perception as if they were pronounced word by word, and there are stops to the audial perception. This is so because the prosody predicting model (fundamental frequency prediction and audio duration prediction) is very sensitive as to whether the current syllable is at the boundary of the prosodic word or inside the prosodic word. Conversely, if  is taken as a prosodic word, its fundamental frequency curve will be heard as very natural, since the fundamental frequency predicting model takes more concerted pronunciation into consideration. Additionally, the audio duration model does not protract the audio durations of the first three syllables , , and , because all the types of the boundaries of these three syllables currently pertain to the internal type of the prosodic word.