This invention relates to a variable frame length vocoder, and more particularly to improvements in a dynamic characteristic of the synthesis filter and the compression of the data rate.
A vocoder using the so-called LSP (Line Spectrum Pair) as speech spectrum information has the advantage that high quality synthesized speech is obtainable with a low data rate. The principle and examples of the application of the principle are given in detail in the paper by Fumitada Itakura et al. entitled "A HARDWARE IMPLEMENTATION OF A NEW NARROW TO MEDIUM BAND SPEECH CODING", International Conference on Acoustics Speech and Signal Processing (ICASSP), 1982, pp. 1964 to 1967.
The parameter value such as the LSP parameter indicating the spectrum information of the speech changes at a relatively gentle rate although sometimes abruptly. For example, while the parameter abruptly changes at a transition part of a vowel or consonant, the change at a voiced sound part is extremely gentle. Consequently, by changing frame length in accordance with the time change characteristic of the parameters, further information compression will be attainable as compared with a vocoder with the frame length fixed. The vocoder according to such system is called a variable frame length vocoder, which is proposed in the paper by John M. Turner and Bradley W. Dickinson entitled "A VARIABLE FRAME LENGTH LINEAR PREDICTIVE CODER", International Conference on Acoustics Speech and Signal Procesing (ICASSP), 1978, pp. 454 to 457, and the report by Katsunobu Fushikida: "A VARIABLE FRAME RATE SPEECH ANALYSIS-SYNTHESIS METHOD USING OPTIMUM SQUARE WAVE APPROXIMATION", Acoustics Institute of Japan, May 1978, p. 385 to 386.
The variable frame length vocoder proposed in the former report uses a long frame interval for a portion with gentle change and a short frame interval for a portion with abrupt change in the characteristic of a spectrum power envelope. The latter report describes a technique using an optimum rectangular approximation based on dynamic programming (DP) and is based on the vocoder proposed in the former report. In this technique a predetermined number of frames are classified into a plurality of groups to minimize an error according to an optimum rectangular approximation, and thus a representative frame is obtained. However, the parameter between adjacent representative frames exhibits an abrupt change change in the above systems, which may cause the following problems.
In the variable frame length vocoder, a spectrum information parameter obtained through analysis is applied to the synthesis filter as a filter coefficient to change the transfer function of the synthesis filter each frame period. The quality of the speech synthesized by the synthesis filter is not determined only by the instantaneous value of the transfer function of the synthesis filter, or static characteristic, but depends largely on a change in the transfer function, or dynamic characteristic. When the transfer function changes abruptly and thus the change is nearly stepwise, the so-called "echo sound" is generated which degrades the quality of the synthesized speech. To suppress the echo sound, the representative frame section obtained on the analysis side is conventionally subjected to a linear interpolation to smooth a time change of the parameter, thereby improving the dynamic characteristic of the synthesis filter.
According to this method, however, the spectral characteristic of the synthesized speech does not coincide precisely with that of an input speech signal, thus generating an unnatural synthesized speech.
Then, in the above-mentioned LSP vocoder, there is an LSP type pattern matching vocoder available for carrying out a further information compression. A conception of such a pattern matching vocoder is disclosed, for example, in the report by HOMER DUDLEY entitled "Phonetic Pattern Recognition Vocoder for Narrow-Band Speech Transmission", THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, Vol. 30, No. 8, August 1958, pp. 733 to 739, or the report by Raj Reddy and Robert Watkins: "USE OF SEGMENTATION AND LABELING IN ANALYSIS-SYNTHESIS OF SPEECH", International Conference on Acoustics Speech and Signal Processing (ICASSP), 1977, pp. 28 to 32.
The LSP type pattern matching vocoder comprises selecting the most similar reference pattern to an input pattern among predetermined reference patterns by collating (matching) LSP coefficients analyzed on an LSP analyzer with those of the reference pattern, transmitting it to the synthesis side together with the sound source information. This method has recently become well known as a method capable of further information compression, and can be easily constituted by adding a pattern matching function and a decoding function to an LPC vocoder.
A parameter space distance is employed as a pattern matching measure in the LSP type pattern matching vocoder. LSP coefficient can be regarded as a space vector as in the case of LPC, PARCOR coefficients, and the reference pattern most approximate to LSP coefficient of an input speech signal is selected by estimating the distances. The distance between LSP information which is a space vector is indicated by a spectral distance E.sub.i,j given in the following expression: ##EQU1## where S.sub.i (.omega.) and S.sub.j (.omega.) indicate logarithmic vectors of frames i and j which are functions of a frequency.
In order to select the reference pattern most approximate to a spectral envelope of the input speech signal among a reference pattern group registered beforehand, a calculation of spectral distance according to the expression (1) must be carried out for all frames. However, the arithmetic operation may run really vast in volume. Therefore, the spectral distance E.sub.i,j given by the following expression (2) is generally used as a matching measure. ##EQU2## where P.sub.k.sup.(i) and P.sub.k.sup.(j) indicate LSP coefficient vectors having S dimensions in frame i and j, respectively, and W.sub.k indicates a weighting coefficient proportional to the LSP spectral sensitivity which is determined according to each LSP coefficient P.sub.k.
A degree of the LSP coefficient corresponds to the degree of a all-pole digital filter for constituting a vocal carrier filter to be realized by the LSP coefficient. In the all-pole digital filter of S degree, S pieces of line spectra .omega..sub.1, .omega..sub.2, .omega..sub.3, . . . .omega..sub.k . . . .omega..sub.s called LSP frequency are used. The LSP spectral sensitivity W.sub.k indicates a degree of spectral change caused by an infinitesimal change of the LSP coefficient of S degree, for which LSP frequency spectral sensitivity determined in response to LSP frequency is normally used.
A distance calculation according to the expression (2) is carried out by obtaining the sum of the square of the difference between LSP coefficient P.sub.k.sup.(i) of K-th frame which is a space feature vector of the analyzed input speech signal and a space feature vector P.sub.k.sup.(j) registered as the reference pattern at every LSP coefficients of each degree, and then multiplying the squared difference by W.sub.k which is predetermined at every one of the LSP frequencies corresponding to the degree of LSP coefficient.
As described above, in the conventional distance calculation according to the expression (2), an LSP frequency spectral sensitivity determined by the LSP frequency is utilized as the weighting coefficient W.sub.k. However, it has been confirmed that the LSP frequency spectral sensitivity also depends on LSP frequency interval. Therefore, the spectral distance calculation carried out simply according to the expression (2) is not satisfactory as a matching measure and deteriorates the quality of the synthesized voice.