1. Field of the Invention
This invention relates to a method and an apparatus for synthesizing the rule based speech by concatenating speech units extracted from speech data.
2. Description of Related Art
A rule based speech synthesizing apparatus for synthesizing the speech by concatenation of speech units extracted from speech data has so far been known. In this rule based speech synthesizing apparatus, the speech waveform is first generated and the prosody is imparted to the so generated speech waveform to output the synthesized speech. In this case, it is known that unit for synthesis, by which the speech is synthesized for generating the speech waveform, significantly affects the quality of the as-synthesized speech.
In particular, the deterioration of the sound quality due to concatenation distortion caused by mismatching at the junction of the synthesis units poses a problem. Several methods have so far been proposed for optimizing the synthesis units for preventing the adverse effect of the concatenation distortion. For example, the technology called phoneme environment clustering (COC) is disclosed in the Japanese Laid-Open Patent Publication S64-78300 entitled ‘Speech Synthesis Method’, whilst the method for selecting an optimum speech unit, with the phoneme as the smallest unit, by wine-pressing an optimum candidate depending on phoneme linkage in the use environment, is disclosed in the Japanese Laid-Open Patent Publication H8-248972 entitled ‘Rule Based Speech Synthesis Apparatus’.
[Patent Publication 1]
Japanese Laid-Open Patent Publication S64-78300
[Patent Publication 2]
Japanese Laid-Open Patent Publication H8-248972
The conventional methods, shown in the above Patent Publications 1 and 2, reside in selecting a relatively small number of sets of speech elements, which will statistically reduce the concatenation distortion, from a relatively large quantity of the synthesis units contained in a speech database. In case the rule based speech synthesis is carried out using the set of the speech segments obtained by this method, there is raised a problem that the quality of the synthesized speech is varied depending on uttered contents. That is, there persists a drawback that, even though the concatenation distortion is small and the speech synthesized imparts a smooth hearing feeling, when an uttered sentence is synthesized, the combination of speech elements, suffering from the concatenation distortion, is used when another uttered sentence is synthesized, such that the resulting synthesized speech imparts an extraneous sound feeling at the junction of the speech elements.