The generation of speech artificially by some machines is called speech synthesis. Speech synthesis is an important component part for human-machine speech communication. Usage of speech synthesis technology may allow the machine to speak like people, and may transform some information represented or stored in other forms to speech, such that people can easily obtain such information by auditory sense.
Currently, a great deal of research is being applied to text to speech (US) systems, in which text to be synthesized is generally input, it is processed by a text analyzer contained in the system, and pronunciation describing characters are output which include phonetic notation in segment level and rhythm notation in super-segment level. The text analyzer first divides text to be synthesized into words with attribute labels and its pronunciation based on pronunciation dictionary, and then determines linguistic and rhythm attributes of object speech such as sentence structure and tone as well as pause word distance and so on for each word, each syllable according to semantic rule and phonetic rule. Thereafter, the pronunciation describing character is input to a synthesizer contained in the system and, through speech synthesis, the synthesized speech is output.
In the art, acoustic models based on the Hidden Markov Model (HMM) have been widely used in speech synthesis technology, and it can easily modify and transform the synthesized speech. Speech synthesis is generally grouped into model training and synthesizing parts. In the model training stage, the training of a statistic model is performed for acoustic parameters contained in respective speech unit in speech database and label attributes such as corresponding segment, rhythm and the like. These labels originate from language and acoustic knowledge, and context features composed of them describe corresponding speech attributes (such as tone, part of speech and the like). In the training stage of the HMM acoustic model, estimation of model parameters originates from statistic computation for these speech unit parameters.
In the art, in view of so much more context combinations with many changes, a tree clustering method using decision trees is generally used to process the changes. Decision trees may cluster candidate primitives having context features similar to that of acoustic features into one category, thereby avoiding data sparsity efficiently and efficiently reducing the number of models. A question set is a set of questions for the construction of the decision tree, and the question selected while node is split is bound to this node, so as to decide which primitives come into the same leaf node. Clustering procedure refers to predefined question set, each node of the decision tree is bound with a “Yes/No” question, all of candidate primitives allowable to come into root node need to answer the question bound on node, and it proceeds into left or right branch depending upon answering result. Thus, each syllable or phoneme having same or similar context feature locates the same leaf node of decision tree, and the model corresponding to the node may be HMM or its state which is described by model parameter. Meanwhile, clustering is also a procedure of learning to process new cases encountered in synthesis, thereby achieving optimum matching. The HMM model and decision tree can be obtained by training and clustering the training data.
In the synthesizing stage, the context feature labels of heteronym are obtained by a text analyzer and a context label generator. For the context feature label, corresponding acoustic parameter (such as the state sequence of the HMM acoustic model) are found in the trained decision tree. Then, a corresponding speech parameter is obtained by performing the parameter generating algorithm on the model parameter, such that speech is synthesized by synthesizer.
The target of the speech synthesis system is to synthesize intelligent and natural voices. However, it is difficult to guarantee precision of pronunciation for Chinese speech synthesis systems, because pronunciation of the heteronym is often determined according to semantic and comprehension of semantic is a challenge task. Such dependency results in lower than satisfactory precision for prediction of heteronym. In the art, even if the prediction of a pronunciation isn't affirmative, speech synthesis system can generally provide an affirmative pronunciation for the heteronym.
In Chinese, different pronunciations represent different meanings. If the speech synthesis system provides the wrong pronunciation, the listener may get an ambiguous meaning and it is undesirable. Thus, with respect to the speech synthesis system applied into living, working and science research (such as car navigation, automatic voice service, broadcasting, human robot animation, and etc), unsatisfactory user experience will be caused due to obvious erroneous heteronym pronunciation. Thus, in the field of speech synthesis, there is a need of improved methods and systems for heteronym speech synthesis.