Speech is the most convenient way for humans to communicate with each other. With the development of speech technology, speech has become the most convenient interface between humans and machines/computers. The speech technology mainly includes speech recognition and text-to-speech (TTS) technologies.
The existing TTS systems, such as formant and small-corpus concatenative TTS systems, deliver speech with a quality that is unacceptable to most listeners. Recent development in large-corpus concatenative TTS systems makes synthesized speech more acceptable, enabling human-machine interactive systems to have wider applications. With the improvement of the TTS systems' quality, various human-machine interactive systems, such as e-mail readers, news readers, in-car information systems, etc., have become feasible.
However, with the wider and wider application of various human-machine interactive systems, people hope to have the speech output quality of these human-machine interactive systems further improved through research on TTS systems.
Generally, a general-purpose TTS system tries to mimic human speech with speech units at a very low level, such as phone, syllable, etc. Choosing such small speech units is actually a compromise between the TTS system's quality and flexibility. Generally speaking, the TTS system that uses small speech units like phones or syllables may deal with any text content with a relatively reasonable number of joining points, so it has good flexibility, while the TTS system using big speech units like words, phrases, etc. may improve quality because of a relatively small number of joining points between the speech units, but the drawback of this TTS system is that the big speech units would cause difficulties in dealing with “out of vocabulary (OOV)” cases, that is, the TTS system using big speech units has poor flexibility.
As to the application of the synthesized speech, it may be found that some applications have a very narrow use domain, for instance, a weather-forecast IVR (interactive voice responding) system, a stock quoting IVR system, a flight-information querying IVR system, etc. These applications highly depend on their use domains and have a very limited number of synthesizing patterns. In such cases, the TTS system has an opportunity to take advantages of the big speech units like word/phrase so as to avoid too many joining points and can mimic speech with high quality.
In the prior art, there are many TTS systems based on the word/phrase splicing technology. The U.S. Pat. No. 6,266,637 assigned to the same assignee of the present invention discloses a TTS system based on the word/phrase splicing technology. Such a TTS system splices all the words or phrases together to construct a remarkably natural speech. When such a TTS system based on the word/phrase splicing technology cannot find corresponding words or phrases in its dictionaries, it will use the general-purpose TTS system to generate the synthesized speech corresponding to the words or phrases. Although the TTS system with word/phrase splicing technology may search for word or phrase segments from different speeches, it cannot guarantee the continuity and naturalness of the synthesized speech.
It is well known that, as compared with the synthesized speech based on the word/phrase splicing technology, human speech is the most natural voice. There is a lot of syntactic and semantic information embedded in human speech in a completely natural way. When researchers continuously improve the general-purpose TTS systems, they also acknowledge that there is no perfect substitute for pre-recorded human speech. Thus, in order to further improve the quality of the synthesized speech, in some specific application domains, the bigger speech units, such as sentences, should be fully used, so as to guarantee the continuity and naturalness of the synthesized speech. However, up to now, there is still not any technical solution that directly utilizes such bigger speech units to generate synthesized speech with high quality.