Text-to-speech conversion refers to the technology that intelligently converts words into natural voice flow by using the designs of advanced natural language processing algorithms under the support of computers. TTS facilitates user interaction with the computer, thereby improving the flexibility of the application system.
A typical TTS system as shown in FIG. 1 comprises a text analysis unit 101, a prosody prediction unit 102 and a speech synthesis unit 103. The text analysis unit 101 is responsible for parsing the input plain text into rich text with descriptive prosody annotations such as pronunciations, stresses, phrase boundaries and pauses. The prosody prediction unit 102 is responsible for predicting the phonetic representation of prosody, such as values of pitch, duration and energy of each synthesis segment, according to the result of text analysis. The speech synthesis unit 103 is responsible for generating intelligible voices as a physical result of the representation of semantics and prosody information implicitly contained in the plain text.
For example, performing TTS on the text  will result in the following. First the text is input into the text analysis unit 101, so that the pronunciation of each character and the phrase boundaries are identified as follows. The following example uses Chinese language text, but of course the present invention may be applied to any language.                .        zhe4 shi4 yi2 ge4 zhuan1 li4 shen1 qing3        
With the above text analysis, the prosody prediction unit 102 performs prosody prediction on the characters in the text. Then, the speech synthesis unit 103 will produce the voice corresponding to said text based on the predicted prosody information. In current TTS technologies, statistics-based distance definition approaches are an important tendency. In these kinds of approaches, text analysis and prosody prediction models are trained from a large labeled corpus, and speech synthesis is always based on selection of multiple candidates for each synthesis segment. A general framework for the TTS-based corpus is shown in FIG. 2.
In statistics based approaches, especially in prosody prediction and inventory based selection, many difficult problems involve the distance definition between a sample and a given cluster. Even with complex contexts to cluster data, the problem of data dispersing is so serious in almost every cluster, and the overlap among clusters is so serious, that it is difficult to evaluate whether the sample belongs to the given cluster.
There are some classical definitions used in current TTS, such as the weighted Euclid distance and the Mahalanobis distance. For the Euclid distance, by using an average of the used sample points as the sample point, it is often difficult to choose the most appropriate value to be the sample point. Moreover, the relationship among different dimensions may be ignored or poorly modeled by pre-given knowledge. A problem with the Mahalanobis distance is the poor capability to simulate the complex distribution.
FIG. 3 is a histogram, with the duration distribution of a sample in a cluster in a TTS corpus being a log distribution. As shown in FIG. 3, the data is so dispersive that the mean value approach of the Euclid distance is not able to simulate its distribution, and Mahalanobis distance seems difficult for a refined simulation also because it is not a normal distribution.