1. Field of the Invention
This invention relates to a rule-based speech synthesis device that synthesizes speech, and more particularly to a rule-based speech synthesis device that synthesizes speech from an arbitrary vocabulary.
2. Description of Related Art
Text-to-speech conversion (the conversion of a text document into audible speech) has hitherto been configured from a text analysis part and a rule-based speech synthesis part (parameter generation part and waveform synthesis part).
Text containing a mixture of kanji and kana characters (a Japanese-language text document) is input to the text analysis part, where this document is subjected to morphological analysis by referring to a word dictionary, the pronunciation, accentuation and intonation of each morpheme are analyzed (if necessary, syntactic and semantic analysis and the like are also performed), and then phonological symbols (intermediate language) with associated prosodic symbols are output for each morpheme.
In the parameter generation part, prosodic parameters such as pitch frequency patterns, phoneme duration times, pauses and amplitudes are set for each morpheme.
In the waveform synthesis part, speech synthesis units in the target phoneme sequence (intermediate language) are selected from previously stored speech data, and waveform synthesis processing is performed by concatenating/modifying the reference data of these speech synthesis units according to the parameters determined in the parameter generation part. The type of speech synthesis units that have been tried out is phonemes, syllables (CV), and VCV/CVC (C=consonant, V=vowel). Although phonemes have the least number of possible representations, it is essential to incorporate rules for coarticulation, which is not easy to do. Consequently, the resulting synthesized speech has had poor quality, and phonemes are now seldom used as speech synthesis units. On the other hand, CV, VCV and CVC units include coarticulation within each unit. For example, since a VCV type comprises a consonant between two vowels, the consonant part is very clear. And since a CVC type is concatenated with consonants which have small amplitude, the concatenation distortion is small. Recently, units consisting of even larger phonetic chain have also been partially used as speech synthesis units.
As the speech data in the speech synthesis units, a method has come to be used whereby original audio waveforms are used unaltered, and based on this, high quality synthesized sound is obtained with little degradation of quality.
To obtain more natural-sounding synthesized speech with the abovementioned conventional text-to-speech conversion, the way in which the parameters in the abovementioned parameter generation part (pitch frequency pattern, phoneme duration time, pauses, amplitude) are appropriately controlled to approximate natural speech while considering the type of speech synthesis units, the speech segment quality and the synthesis procedure is of great importance.
Of these parameters, methods for controlling the phoneme duration time in particular have hitherto been described in Reference 1 (Japanese Patent Application Laid-Open No. S63-46498) and Reference 2 (Japanese Patent Application Laid-Open No. H4-134499).
The techniques described in the abovementioned References 1 and 2 are methods which use a statistical model (Hayashi's first method of quantification model) to obtain control rules by analyzing a large amount of data. As is well known, a Hayashi's first method of quantification is one of multivariate analysis technique wherein the target external criterion (phoneme duration time) is calculated based on qualitative factors, and is formulated as shown in Formulae (1) through (3) below.
That is, if j is the ith data element item, k is the category to which it belongs, and x(jk) is the category quantity thereof (the coefficient associated with the category), then the estimated values y(i) are given by Formula (1).                                           y            ⁡                          (              i              )                                =                                    ∑              j                        ⁢                                          ∑                k                            ⁢                              ×                                  (                  jk                  )                                ⁢                                                                  ⁢                                  δ                  ⁡                                      (                    jk                    )                                                                                      ⁢                                  ⁢                  where:                                    (        1        )                                                                                    δ                ⁡                                  (                  jk                  )                                            =                            ⁢                              1                ⁢                                  (when  data                    i                    corresponds  to  category                    k                    of  item                    j                  )                                                                                                        =                            ⁢                              0                ⁢                                  (otherwise)                                                                                        (        2        )            
x(jk) is determined by the method of least square. That is, it is determined by minimizing the squared error between the estimated values y(i) and the actual measured values Y(i).                                           ∑            i                    ⁢                                    {                                                y                  ⁡                                      (                    i                    )                                                  -                                  Y                  ⁡                                      (                    i                    )                                                              }                        2                          ->        minimum                            (        3        )            
The equation has to be solved by partially differentiating Formula (3) by x(jk). When a computer is used to perform real calculations based on Formula (3), it results in a numerical analysis problem to solve simultaneous equations.
In the abovementioned conventional phoneme duration time controling method, categorization into Hayashi's first method of quantification form does not always work well, making it impossible to achieve adequate estimation precision. Also, these conventional methods make no mention of methods for setting the closing length in phonemes having a closing interval (such as unvoiced plosive consonants). Accordingly, there have hitherto been no methods for appropriately controlling the closing interval length, which is of great perceptual importance.
The principal object of the present invention is to provide a rule-based speech synthesis device that can estimate phoneme duration times more accurately and has smaller estimation errors and better control functions, and in particular it aims to provide a suitable closing time length control method for phonemes having a closing interval (such as unvoiced plosive consonants), and as a result, an object of the present invention is to provide a rule-based speech synthesis device with improved quality.