1. Field of the Invention
The present invention relates generally to a speech synthesis method for text-to-speech synthesis, and more particularly to a speech synthesis method for generating a speech signal from information such as a phoneme symbol string, a pitch and a phoneme duration.
2. Description of the Related Art
A method of artificially generating a speech signal from a given text is called “text-to-speech synthesis.” The text-to-speech synthesis is generally carried out in three stages comprising a speech processor, a phoneme processor and a speech synthesis section. An input text is first subjected to morphological analysis and syntax analysis in the speech processor, and then to processing of accents and intonation in the phoneme processor. Through this processing, information such as a phoneme symbol string, a pitch and a phoneme duration is output. In the final stage, the speech synthesis section synthesizes a speech signal from information such as a phoneme symbol string, a pitch and phoneme duration. Thus, the speech synthesis method for use in the text-to-speech synthesis is required to speech-synthesize a given phoneme symbol string with a given prosody.
According to the operational principle of a speech synthesis apparatus for speech-synthesizing a given phoneme symbol string, basic characteristic parameter units (hereinafter referred to as “synthesis units”) such as CV, CVC and VCV (V=vowel; C=consonant) are stored in a storage and selectively read out. The read-out synthesis units are connected, with their pitches and phoneme durations being controlled, whereby a speech synthsis is performed. Accordingly, the stored synthesis units substantially determine the quality of the synthesized speech.
In the prior art, the synthesis units are prepared, based on the skill of persons. In most cases, synthesis units are sifted out from speech signals in a trial-and-error method, which requires a great deal of time and labor. Jpn. Pat. Appln. KOKAI Publication No. 64-78300 (“SPEECH SYNTHESIS METHOD”) discloses a technique called “context-oriented clustering (COC)” as an example of a method of automatically and easily preparing synthesis units for use in speech synthesis.
The principle of COC will now be explained. Labels of the names of phonemes and phonetic contexts are attached to a number of speech segments. The speech segments with the labels are classified into a plurality of clusters relating to the phonetic contexts on the basis of the distance between the speech segments. The centroid of each cluster is used as a synthesis unit. The phonetic context refers to a combination of all factors constituting an environment of the speech segment. The factors are, for example, the name of phoneme of a speech segment, a preceding phoneme, a subsequent phoneme, a further subsequent phoneme, a pitch period, power, the presence/absence of stress, the position from an accent nucleus, the time from a breathing spell, the speed of speech, feeling, etc. The phoneme elements of each phoneme in an actual speech vary, depending on the phonetic context. Thus, if the synthesis unit of each of clusters relating to the phonetic context is stored, a natural speech can be synthesized in consideration of the influence of the phonetic context.
As has been described above, in the text-to-speech synthesis, it is necessary to synthesize a speech by altering the pitch and duration of each synthesis unit to predetermined values. Owing to the alternation of the pitch and duration, the quality of the synthesized speech becomes slightly lower than the quality of the speech signal from which the synthesis unit was sifted out.
On the other hand, in the case of the COC, the clustering is performed on the basis of only the distance between speech segments. Thus, the effect of variation in pitch and duration is not considered at all at the time of synthesis. As a result, the COC and the synthesis units of each cluster are not necessarily proper in the level of a synthesized speech obtained by actually altering the pitch and duration.
An object of the present invention is to provide a speech synthesis method capable of efficiently enhancing the quality of a synthesis speech generated by text-to-speech synthesis.
Another object of the invention is to provide a speech synthesis method suitable for obtaining a high-quality synthesis speech in text-to-speech synthesis.
Still another object of the invention is to provide a speech synthesis method capable of obtaining a synthesis speech with a less spectral distortion due to alternation of a basic frequency.