The synthesis unit based on a large corpus has become a possible way to generate general-purpose speech sounds in TTS systems. Corpus-based TTS has become the major trend because the resulted speech sounds are more natural than that produced by parameter-driven production models. The key issues for this approach may include a well-designed and recorded corpus, manual or automatic labeling of segmental and prosodic information, selection or decision of synthesis unit types, and selection of the speech segments for each unit type.
Features for defining unit types may include context-independent features or context-dependent features, or both. FIG. 1 shows exemplary features for defining unit types. In the FIG. 1, for example, context-independent features may include the phonetic syllable and the prosodic tone. Context-dependent features may include the phonetic left/right phone and the prosodic left/right tone.
Any one unit type may be specified by a feature vector consisting of various dimensions of features. The feature vector with the features of the unit itself is called Unit Vector (UV). On the other hand, the Context Vector (CV) consists of text information of a unit. Therefore, context-dependent unit may be specified by Contextual Unit Vector (CUV), which is concatenated by UV and CV. FIG. 2 illustrates the size of the feature vector space depends on the resolution of each feature dimension based on FIG. 1. In the FIG. 2, three exemplary unit classes, CU2, CU3, and CU4 are used.
A typical method used to build a synthesizer is directly recording 413 syllable types in a single-syllable manner. This may make the segmentation easier, avoid co-articulation problem, and usually may have a more stationary waveform and steady prosody. However, it is not only found that the synthetic speech produced by the speech segments extracted from single syllable recording sounds unnatural, but also believed that this kind of speech segments is not suitable for multiple segment units selection. This is because neither natural prosody nor contextual information could be utilized in a single syllable recording system. Therefore, how to select a well-designed text script for speech recording may be one of the key factors for TTS systems.
There are generally two approaches to the text script generation. One is to emphasize the diversity of unit types in the inventory. The other is to pursue the probability for the unit type of an input case to be found in the inventory. The first approach tries to select the text containing richness of phonetic and prosodic features. The text script is usually selected from more than one corpus to search for various kinds of contextual combinations. Even sentences designed purposely by linguists are also used. Fully automatic methods, for example, greedy algorithm are broadly used in some applications, too. This approach may produce a text script with large size that will cost a lot both for building a TTS system and for the storage requirement of the system.
The second approach represents the recent trend to use a very large corpus. The weighted greedy algorithm is used to select a subset corpus from a large raw text corpus. The weights could be applied in two ways: occurring frequencies of unit types or reciprocal of frequencies of unit types. There is a list of necessary unit vectors built first by sorting the occurring rate of each unit vector and leaving high-occurring-rate ones that have accumulated frequency larger than a specified number in the list. With the weighted greedy algorithm, the sentence with highest sum of weights will be selected first, and then occurred units would be deleted in the list of necessary unit vectors. The occurring rates of the unit types in the large corpus are taken into account in text script generation so as to maximize the probability to hit the same unit type in synthesis. Since there is a risk of missing some core unit types, an approach is to fill up enough number of each core unit types in the list. The problem is some kind of fixed, but the algorithm may not be precisely controllable and flexibly scalable. One cannot decide when to stop the procedure except end of the experiment and passively accept the resulted hit rate, covering rate, and text script size.
In other words, one approach to the text script generation for a corpus-based TTS system may emphasize the diversity of unit types in the inventory, i.e. covering rate of unit types. The other approach may pursue the probability for the unit type of an input case to be found in the inventory, i.e. hit rate of unit instances.