The goal of a system for text to speech synthesis (TTS) is to make a computer speak out natural voice as a man does. When a man is reading a sentence naturally, apart from some of the punctuations (e.g. period, comma, etc.) as inherent pause indications, there will be some pause at locations without punctuation. Thus, in order to achieve synthesis voice with higher quality, the voice synthesis system should have the ability to automatically decide which locations without punctuation also needs to pause, which needs to perform prosody parsing on the text as a front-end process to help to improve the quality of voice synthesis.
There is proposed a rule-learning based method for predicting Chinese prosody structure in “Rule-learning based prosodic structure prediction”, ZHAO Sheng, et al, Journal of Chinese Information Processing, Vol. 16, No. 5, 2002.9, PP. 30-37. The method extracts linguistic features and two-level prosodic structure tags from a manually prosodic labeled corpus, establishes an example database, and then automatically induces rules for prosodic phrase prediction from the example by using rule-learning algorithms.
However, in the foregoing method, a large amount of corpus that has been prosody parsed in advance is needed, whereas performing prosody parsing on a corpus is an arduous work and it is hard to control its quality.