1. Field of the Invention
The present invention generally relates to a method for the text script generation of high efficiency, and more particularly, a method for generating a scalable and controllable text script of high efficiency in the design of corpus-based text to speech (TTS) systems.
2. Description of Prior Art
Because of the improvement of computer hardware, concatenated speech synthesis based on a large corpus becomes a possible way to generate general-purpose speech sounds. Corpus-based TTS has become the major trend because the resulted speech sounds are more natural than that produced by parameter-driven production models. The key issues for this approach include a well-designed and recorded corpus, manual or automatic labeling of segmental and prosodic information, selection or decision of synthesis unit types, and selection of the speech segments for each unit type.
We used to build a synthesizer by directly recording the 411 syllable types in a single-syllable manner. This makes the segmentation easier, avoids co-articulation problem, and usually has a more stationary waveform and steady prosody. However, we not only find that the synthetic speech produced by the speech segments extracted from single syllable recording sounds unnatural, but also believe that this kind of speech segments is not suitable for multiple segment units selection. This is because neither natural prosody nor contextual information could be utilized in a single syllable recording system.
Conventionally, there are two approaches to the text script generation. One is to emphasize the diversity of unit types in the inventory. The other is to pursue the probability for the unit type of an input case to be found in the inventory. The first approach tries to select the text containing richness of phonetic and prosodic features. The text script is usually selected from more than one corpus to search for various kinds of contextual combinations. Even sentences designed purposely by linguists are also used. Fully automatic methods, for example, greedy algorithm are broadly used in some applications, too. The disadvantage of this approach is to produce a text script with large size that will cost a lot both for building a TTS system and for the storage requirement of the system.
The second approach represents the recent trend to use a very large corpus. The weighted greedy algorithm is used to select a subset corpus from a large raw text corpus. The weights could be applied in two ways: occurring frequencies of unit types or reciprocal of frequencies of unit types. There is a list of necessary unit vectors built first by sorting the occurring rate of each unit vector and leaving high-occurring-rate ones that have accumulated frequency larger than a specified number in the list. With the weighted greedy algorithm, the sentence with highest sum of weights will be selected first, and then occurred units would be deleted in the list of necessary unit vectors. The occurring rates of the unit types in the large corpus are taken into account in text script generation so as to maximize the probability to hit the same unit type in synthesis. Since there exist risks of missing some core unit types, an approach is to fill up enough number of each core unit types in the list. The problem is some kind of fixed, but the algorithm will not be precisely controllable and flexibly scalable. One cannot decide when to stop the procedure except end of the experiment and passively accept the resulted hit rate, covering rate, and text script size.
As aforementioned, we invent an integrated new method for generating text script in corpus based TTS design to produce better performance so the disadvantages mentioned above can be overcome.