1. Field of the Invention
The present invention relates to the field of speech synthesis, and more particularly, to a method of speech segment selection for concatenative synthesis based on prosody-aligned distance measure.
2. Description of Related Art
Currently, the method of concatenative speech synthesis based on a speech corpus has become the major trend because the resulted speech sounds more natural than that produced by parameter-driven production models. The key issues of the method include a well-designed and recorded speech corpus, manual or automatic labeling of segmental and prosodic information, selection or decision of synthesis unit types, and selection of the speech segments for each unit type.
Early synthesizer is built by directly recording the 411 syllable (unit segment) types in a single-syllable manner in order to select Chinese speech segments. It makes the segmentation easier, avoids co-articulation problem, and usually has a more stationary waveform and steady prosody. However, the synthetic speech produced by the speech segments extracted from single syllable recording sounds unnatural, and this kind of speech segments is not suitable for multiple segment units selection. This is because neither natural prosody nor contextual information could be utilized in a single syllable recording system.
In order to solve the above problem, there is provided a continuous speech recording system whereby both fluent prosody and contextual information can be taken into account. However, this method needs to build a large speech corpus which needs manual intervention, so that it becomes labor-intensive and is prone to come into inconsistent results.
U.S. Pat. No. 6,173,263 discloses a method and system for performing concatenative speech synthesis using half-phonemes. In such a method, a half-phoneme is a basic synthetic unit (candidate), and a Viterbi searcher is used to determine the best match of all half-phonemes in the phoneme sequence and the cost of the connection between half-phoneme candidates. U.S. Pat. No. 5,913,193 discloses a method and system of runtime acoustic unit selection for speech synthesis. This method minimizes the spectral distortion between the boundaries of adjacent instances, thereby producing more natural sounding speech. U.S. Pat. No. 5,715,368 discloses a speech synthesis system and method utilizing phoneme information and rhythm information. This method uses phoneme and rhythm information to create an adjunct word chain, and synthesizes speech by using the word chain and independent words. U.S. Pat. No. 6,144,939 discloses a formant-based speech synthesizer employing demi-syllable concatenation with independent cross fade in the filter parameter and source domains. In such a method, concatenation of the demi-syllable units is facilitated by a waveform cross fade mechanism and a filter parameter cross fade mechanism. The waveform cross fade mechanism is applied in the time domain to the demi-syllable source signal waveforms, and the filter parameter cross face mechanism is applied in the frequency domain by interpolating the corresponding filter parameters of the concatenated demi-syllables.
However, none of the aforesaid prior arts estimates the distortion resulted from prosody modification in the synthesis phase when selecting the synthesis unit. Using the concept of synthesizer-embedding in the analysis phase, the distortion measure is related objectively and corresponds highly to the actual quality of the synthetic speech.