A speech synthesis method is a technique to automatically generate a synthesized speech signal from inputted prosodic information. According to the prosodic information such as phonemic symbols, phonemic time length, pitch pattern and power, characteristic parameter of small unit (synthesis unit) such as syllable, phoneme, one pitch interval stored in a unit dictionary memory is selected. After controlling the pitch and the continuous time length, the characteristic parameters are connected to generate a synthesis speech signal. The speech synthesis technique by this synthesis method by rule is used for text-to speech system to artificially generate a speech signal from an arbitrary text.
In this speech synthesis technique, in order to improve the quality of the synthesized speech signal, as the characteristic parameter of synthesis unit, a waveform extracted from speech data or a pair of speech source signals obtained by analyzing the speech data and coefficients representing a characteristic of the synthesis filter is used.
In the latter case, in order to further improve the quality of synthesized speech, a large number of synthesis units consisting of the speech source signal and the coefficients are stored in the unit dictionary. Suitable synthesis units are selected from the unit dictionary and connected to generate the synthesized speech. In this method, in order to avoid an increase of memory capacity of the unit dictionary, the unit dictionary is previously coded. When synthesizing the speech signal, the coded unit dictionary is decoded by referring to the codebook.
FIG. 1 is a block diagram of the speech synthesis apparatus using the coded unit dictionary information according to the prior art. First, according to the phonemic symbols 100, the phonemic time length 101, the pitch pattern 102 and the power 103, a unit selection section 10 selects a coded representative synthesis unit from the unit dictionary memory 11. FIG. 2 is a schematic diagram of the coded synthesis unit in the unit dictionary memory 11. As shown in FIG. 2, a linear predictive coefficient used as filter coefficient in the synthesis filter is stored as a code index 113 in a linear predictive coefficient codebook 22 (hereafter, it is called as the linear predictive coefficient index 113). The speech source signal is stored as a code index 111 in a speech source signal codebook 21 (hereafter, it is called as the speech source signal index 111). A gain is stored as a code index 110 in a gain codebook 20 (hereafter, it is called as the gain index 110).
The coded synthesis unit selected by the unit selection section 10 is inputted to a synthesis unit decoder 12. In the synthesis unit decoder 12, a linear predictive coefficient requantizer 25 selects a code vector corresponding to the linear predictive coefficient index 113 from a linear predictive coefficient codebook 22 and outputs a requantized (decoded) linear predictive coefficient 122. A speech source signal requantizer 24 selects a code vector corresponding to the speech source signal index 111 from a speech source signal codebook 21 and outputs a requantized (decoded) speech source signal. A gain requantizer 23 selects a code vector corresponding to the gain index 110 from a gain codebook 20 and outputs a requantized (decoded) gain 120. A gain multiplier 27 multiplies the gain 120 with the speech source signal decoded by the speech source signal requantizer 24. The linear predictive coefficient 122 decoded by the linear predictive coefficient requantizer 25 is supplied to the synthesis filter 13 as filter coefficient information. The synthesis filter 13 executes a filtering process for the speech source signal 121 multiplied with the gain 120 and generates a speech signal 123. A pitch/time length controller 14 controls the pitch and the time length of the speech signal 123. A unit connection section 15 connects a plurality of the speech signals whose pitch and time length are controlled. In this way, a synthesis speech signal 104 is outputted.
In this synthesis system by rule, the coded synthesis unit in the unit dictionary memory largely affects the quality of synthesized speech.
In order to rise the quality of speech, in other words, in order to suppress a falling of the quality of synthetic speech by coding, the number of bits for coding of the synthesis unit must be increased. However, if the number of bits for coding increases, the memory capacity requirement of the gain codebook 20, the speech source signal codebook 21, and the linear predictive coefficient codebook 22 largely increases. Especially, in case a vector-quantization is applied to the coding, the memory capacity requirement indexically increases in proportion to the increase in the number of bits for coding of the representative synthesis unit. Conversely, if the number of bits for coding of the synthesis unit decreases to decrease the memory capacity requirement, the quality of the synthesized speech goes down.