Conventionally, a variety of speech synthesis devices have been developed for analyzing a text sentence, and generating synthesized speech from speech information represented by the sentence through a rule synthesis. As documents which disclose related arts, there are Patent Document 1 (Japanese Patent No. 2893697), Non-Patent Document 1 (Huang, Acero, Hon; “Spoken Language Processing,” Prentice Hall, pp. 689-836, 2001), Non-Patent Document 2(Ishikawa, “Prosodic Control for Japanese Text-to-Speech Synthesis,” Technical Report of The Institute of IEICE, The Institute of Electronics, Information and Communication Engineers, Vol. 100, No. 392, pp. 27-34, 2000), Non-Patent Document 3(Abe, “An Introduction To Speech Synthesis Units,” Technical Report of The Institute of IEICE, The Institute of Electronics, Information and Communication Engineers, Vol. 100, No. 392, pp. 35-42, 2000), and Non-Patent Document 4(Moulines Charapentier: “Pitch-synchronous Waveform processing Techniques For Text-To-Speech Synthesis Using Diphones,” Speech Communication 9, pp. 435-567, 1990).
FIG. 1 is a block diagram showing an exemplary configuration of a general rule-synthesis type speech synthesis device. Referring to FIG. 1, the speech synthesis device comprises text analysis unit 20, prosodic feature generation unit 21, phoneme selection unit 22, prosodic feature control unit 23, waveform connection unit 24, and original speech waveform information storage unit 25.
Original speech waveform information storage unit 25 comprises phoneme waveform storage unit 27 which stores original speech waveforms in phoneme units, and additional information storage unit 26 which stores attribute information of each phoneme waveform. Here, the original speech waveform refers to a natural speech waveform which has been previously collected for use in the generation of synthesized speech, while the attribute information of an original speech waveform refers to phonemic information and prosodic information such as a phonemic environment in which an original speech waveform was generated, a pitch frequency, an amplitude, continuation time length information and the like. Also, an original speech waveform divided into phonemes is referred to as a “phonemic waveform” Details on the length and unit of phonemes are described in Non-Patent Documents 1, 3.
Text analysis unit 20 performs a morpheme analysis, a syntactic analysis, and analyses such as reading on an input text sentence, and supplies prosodic feature generation unit 21 and phoneme selection unit 22 with a symbol string representative of “reading” and a part of speech, conjugation, accent type and the like of phonemes as text analysis results. Prosodic feature generation unit 21 generates prosodic feature information (information related to a pitch, a time length, power and the like) of synthesized speech based on the text analysis result supplied from text analysis unit 20, and supplies the prosodic feature information to phoneme selection unit 22, prosodic feature control unit 23, and waveform connection unit 24, respectively.
Phoneme selection unit 22 selects a phoneme waveform, which has a high compatibility between the text result supplied from text analysis unit 20 and the prosodic feature information supplied from prosodic feature generation unit 21, from phoneme waveforms stored in original speech waveform information storage unit 25, and supplies prosodic feature control unit 23 with the selected phoneme waveform together with the additional information.
Prosodic feature control unit 23 generates a waveform having a prosodic feature generated by prosodic feature generation unit 21 from the phoneme waveform selected by phoneme selection unit 22, and supplies the generated waveform (phoneme waveform) to waveform connection unit 24. Waveform connection unit 24 connects the phoneme waveform supplied from prosodic feature control unit 23 to output the connected waveform as synthesized speech.
Prosodic feature control unit 23 performs processing which differs in contents depending on the type and content of generated prosodic feature information because it generates a waveform which has a prosodic feature equivalent to the prosodic feature information generated by prosodic feature generation unit 21. In the configuration shown in FIG. 1, since it is assumed that the prosodic feature information generated by prosodic feature generation unit 21 is comprised of information related to three components, pitch frequency, continuation time length, and power, prosodic feature control unit 23 comprises pitch frequency control unit 30, continuation time length control unit 36, and power control unit 37. Pitch frequency control unit changes the pitch frequency; continuation time length control unit 36 changes the continuation time length; and power control unit 37 changes the power.
There is a scheme in which rearranges pitch waveforms (waveforms having a time length of several pitch lengths) extracted from original speech waveforms are rearranged at a pitch cycle of synthesized speech, as a pitch frequency control schemes generally used in the rule-synthesis type speech synthesis device shown in FIG. 1. Here, the pitch cycle is defined by the inverse of the pitch frequency, and it represents the interval of pitch waveform. Specifically, a pitch waveform is first extracted at a pitch cycle that is previously estimated from an original speech waveform using windowing processing or the like. Then, pitch waveforms are connected at pitch cycle intervals generated from prosodic feature information of synthesized speech. The pitch cycle of the original speech waveform is often defined on the basis of the pitch frequency estimated from the original speech waveform.
In pitch frequency control unit 30, pitch cycle acquisition unit 32 first acquires a pitch cycle of a phoneme waveform from original speech prosodic feature information, and pitch waveform extraction unit 35 extracts pitch waveforms from the phoneme waveform at intervals of the pitch cycle acquired by pitch cycle acquisition unit 32. Then, pitch waveform connection unit 34 connects the pitch waveforms extracted by pitch waveform extraction unit 35 at intervals of the pitch cycle of the synthesized speech acquired by pitch cycle acquisition unit 31.
The pitch waveform extraction processing can be omitted if the pitch waveform has been previously stored in original speech waveform information storage unit 25 without extracting the pitch waveform during the speech synthesis. In this event, during the speech synthesis, a pitch waveform, rather than a phoneme waveform, is read from original speech waveform information storage unit 25, and connection processing is performed by pitch waveform connection unit 34. In the following description, a pitch cycle of an original speech waveform is referred to as the “original speech pitch cycle,” and a pitch cycle generated from prosodic feature information of synthesized speech is referred to as the “synthesized speech pitch cycle.” A representative pitch frequency control scheme may be a PSOLA scheme described in Non-Patent Document 4. In a speech synthesis scheme which utilizes a linear prediction analysis, predicted residual waveforms are subjected to rearrangement, instead of pitch waveforms.
In a general pitch frequency control scheme, a pitch cycle and pitch frequency of original speech fluctuate when the pitch cycle and pitch frequency are found from an original speech waveform, causing a degradation in quality of synthesized speech due to the fluctuations. The fluctuation in pitch cycle refers to a phenomenon in which adjacent pitch waveforms slightly differ in pitch cycle from one another. For example, the fluctuation in pitch cycle is a phenomenon in which a time string of estimated pitch cycles changes such as 201, 198, 200, 199, 202, . . . in a section in which the pitch cycle is 200. From the fact that no fluctuation component exists in a true original speech pitch cycle, the fluctuation component is thought to be an estimation error of a pith cycle which is produced when the pitch cycle is obtained from a waveform. When a true original speech pitch cycle and a fluctuation component are regarded as distinct types of signals, the fluctuation component is a signal which has a smaller amplitude and power than those of the true original speech pitch cycle, and is dominated by high frequency components (mainly comprised of high frequency components). If the pitch frequency is changed without considering this fluctuation, synthesized speech is degraded in sound quality.
For solving the foregoing problem in speech synthesis devices, Patent Document 1 discloses a method of smoothing original speech pitch cycles when the pitch cycle of predicted residual waveform is changed, targeting a speech synthesis device which employs a linear prediction analysis. The method of Patent Document 1 involves smoothing a time string of original speech pitch cycles (pitch cycle string) through a moving average, and correcting synthesized speech for the pitch cycle by using the smoothed original speech pitch cycle. Then, a predicted residual waveform string is generated at the corrected pitch cycle of the synthesized speech.
According to the method described in Patent Document 1, pitch cycle tk′ in smoothing intended frame k is given by the following equation when a frame number is i (where i=0, 1, 2, . . . ), the pitch cycle of the original speech before smoothing is ti, and the pitch cycle of the original speech after smoothing is ti′:
                              t          k          ′                =                              1                                          2                ⁢                                                                  ⁢                W                            +              1                                ⁢                                    ∑                              i                =                                  -                  w                                            w                        ⁢                          t                              k                +                i                                                                                  [                      Expression            ⁢                                                  ⁢            1                    ]                ⁢                                      where w is a window width of the moving average. In Patent Document 1, window width w of moving average is chosen to be “1.”