In general, there are three key modules in a text to speech (TTS) system: the text analysis, the prosodic model and the speech synthesis. One of the important stages in the process of turning unmarked text into speech is the assignment of appropriate phrase break boundaries. The phrase break boundaries are important to later modules including accent assignment, duration control and pause insertion. A number of different algorithms have been proposed for such a task, ranging from the simple to the complex. These different algorithms require different information such as part of speech (POS) tags, syntax and even semantic understanding of the text. Obviously these requirements come at different costs and it is important to trade off difficulty in finding particular input features versus accuracy of the model.
Some of the languages, such as Chinese and Japanese, do not have space between the words. The first step of text analysis for such language processing is word segmentation. Because of the difficulty of syntactic parsing for these languages, most of the conventional TTS systems segment the words in the text analysis procedure, and limit the average length of the words after the segmentation at about 1.6 syllables, through the intrinsic properties of the words. Thus a small pause will be inserted every 1.6 syllables during the speech synthesis if there is no other higher level linguistic information, such as prosodic word, prosodic phrase and intonational phrase. As a result, the speech is not fluent enough. Native speakers tend to group words into phrases whose boundaries are marked by duration and intonational cues. Many phonological rules are constrained to operate only within such phrases, usually termed prosodic phrases. Prosodic phrase will help the TTS system produce more fluent speech, while the prosodic structure of the sentence will also help improve the intelligibility and naturalness of the speech. Therefore placing phrase boundaries is very important to ensure a naturally and sounding TTS system. With correct prosodic phrases detected from text, high quality prosodic model can be created and the acoustic parameters can be provided, which include pitch, energy, and duration, for the speech synthesis.
A lot of methods have been introduced to extract prosodic phrase boundaries from English text, such as statistic model, CART (Classification and Regression Tree), FSA (Finite State Automata), MM (Markov Model), and so on. Some approaches use the language information to parse the text, and then map from the syntactic structure to prosodic structure, some methods make use of POS to extract prosodic phrase from the text. However, these methods tend to have limited quality and complex procedures to accomplish their goals. It is desirable to have an improved method and system for detecting prosodic phrase break.