This invention relates to a speech processor having a speech analyzer and synthesizer, which is useful, among others, in speech communication.
Band-compressed encoding of voice or speech sound signals has been increasingly demanded as a result of recent progress in multiplex communication of speech sound signals and in composite multiplex communication of speech sound and facsimile and/or telex signals through a telephone network. For this purpose, speech analyzers and synthesizers are useful.
As described in an article contributed by B. S. Atal and Suzanne L. Hanauer to "The Journal of the Acoustical Society of America," Vol. 50, No. 2 (Part 2), 1971, pages 637-655, under the title of "Speech Analysis and Synthesis by Linear Prediction of the Speech Wave," and as disclosed in the U.S. Pat. No. 3,624,302 issued to B. S. Atal, it is possible to regard speech sound as a radiation output of a vocal tract that is excited by a sound source, such as the vocal cords set into vibration. The speech sound is represented in terms of two groups of characteristic parameters, one for information related to the exciting sound source and the other for the transfer function of the vocal tract. The transfer function, in turn, is expressed as spectral distribution information of the speech sound.
By the use of a speech analyzer, the sound source information and the spectral distribution information are extracted from an input speech sound signal and then encoded either into an encoded or a quantized signal for transmission. A speech synthesizer comprises a digital filter having adjustable coefficients. After the encoded or quantized signal is received and decoded, the resulting spectral distribution information is used to adjust the digital filter coefficients. The resulting sound source information is used to excite the coefficient-adjusted digital filter, which now produces an output signal representative of the speech sound.
As the spectral distribution information, it is usually possible to use spectral envelope information that represents a macroscopic distribution of the spectrum of the speech sound waveform and thus reflects the resonance characteristics of the vocal tract. It is also possible to use, as the sound source information, parameters that indicate classification into or distinction between a voiced sound produced by the vibration of the vocal cords and a voiceless or unvoiced sound resulting from a stream of air flowing through the vocal tract (a fricative or an explosive), an average power or intensity of the speech sound during a short interval of time, such as an interval of the order of 20 to 30 milliseconds, and a pitch period for the voiced sound. The sound source information is band-compressed by replacing a voiced and an unvoiced sound with an impulse response of a waveform and a pitch period analogous to those of the voiced sound and with white noise, respectively.
On analyzing speech sound, it is possible to deem the parameters to be stationary during the short interval mentioned above. This is because variations in the spectral distribution or envelope information and the sound source information are the results of motion of the articulating organs, such as the tongue and the lips, and are generally slow. It is therefore sufficient in general that the parameters be extracted from the speech sound signal in each frame period of the above-exemplified short interval. Such parameters are well suited to synthesis or reproduction of the speech sound.
Usually, parameters .alpha. (predictive coefficients) specifying the frequencies and bandwidths of a speech signal and parameters K or the so-called PARCOR coefficients representing the variation in the cross sectional area of the vocal tract with respect to the distance from the larynx are used as the spectral distribution or envelope information. Parameters .alpha. can be obtained by using the well-known LPC technique, that is, by minimizing the mean-squared error between the actual values of the speech samples and their predicted values based on the past predetermined samples. These two parameters can be obtained by recursively processing the autocorrelation coefficients as by the so-called Durbin method discussed in "Linear Prediction of Speech," by J. D. Markel and A. H. Gray, Jr., Springer Verag, Berlin, Heldelberg, New York, 1976, and particularly referring to FIG. 3.1 at page 51 thereof. These parameters .alpha. and K adjust the coefficients of the digital filter, i.e., a recursive filter and a lattice filter, on the synthesis side.
Each of the foregoing parameters obtained on the analysis side (i.e., the transmitter side) is quantized in a preset quantizing step and a constant bit allocation, converted into digital signals and multiplexed.
In general, the occurrence rate distribution of values of some of the aforementioned parameters greatly differs depending on whether the original speech signal is voiced sound or unvoiced sound. A K parameter K.sub.1 of the first order, a short-time mean power, and a predictive residual power, for instance, has an extremely different distribution for voiced sound or unvoiced sound (Reference is made to B. S. Atal and Lawrence R. Rabiner, "A Pattern Recognition Approach to Voiced-Unvoiced-Silence Classification with Application to Speech Recognition", IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. ASSP-24, No. 3, June, 1976, particularly to p. 203, FIG. 3, FIG. 4 and FIG. 6 of the paper).
However, notwithstanding the fact that the value of K.sub.1 is predominantly in the range of +0.6 to +1.0 for voiced sound (See the paper by B. S. Atal et al. above), encoding bits have been allocated for values in the other range (-1.0 to +0.6) in the conventional apparatus. This is contrary to the explicit objective of reducing the amount of transmission information. Consequently, it is difficult to achieve sufficient reduction of the amount of information to be transmitted, and also to restore the sufficient amount of required information.
It is to be pointed out in connection with the above that the parameters such as the sound source information are very important for the speech sound analysis and synthesis. This is because the results of analysis for deriving such information have a material effect on the quality of the synthesized speech sound. For example, an error in the measurement of the pitch period seriously affects the tone of the synthesized sound. An error in the distinction between voiced and unvoiced sounds renders the synthesized sound husky and crunching or thundering. Any one of such errors thus harms not only the naturalness but also the clarity of the synthesized sound.
As described in the cited reference by B. S. Atal and Lawrence R. Rabiner, it is possible to use various discrimination or decision parameters for the classification or distinction that have different values depending on whether the speech sounds are voiced or unvoiced. Typical discrimination parameters are the average power (short-time mean power), the rate of zero crossings, the maximum autocorrelation coefficient .rho..sub.MAX indicative of the delay corresponding to the pitch period, and the value of K.sub.1.
However, none of the above discrimination parameters are sufficient as voiced-unvoiced decision information individually.
Accordingly, a new technique has been proposed in the Japanese Patent Disclosure Number 51-149705 titled "Analyzing Method for Driven Sound Source Signals", by Tohkura et al.
In this technique, the determination of optimal coefficients and threshold value for a discrimination function is difficult for the following reasons. In general, the coefficients and threshold value are decided by a statistical technique using multivariate analysis discussed in detail in a book entitled "Multivariate Statistical Methods for Business and Economics" by Ben W. Bolch and Cliff J. Huang, Prentice Hall, Inc., Englewood Cliffs, N.J., USA, 1974 especially in Chapter 7 thereof. In this technique, the coefficients and threshold value with the highest discrimination accuracy are determined when the occurrence rate distribution characteristics of the discrimination parameter values for both voiced and unvoiced sounds are a normal distribution with an equal variance. However, inasmuch as the variance of occurrence rate distribution characteristics of K.sub.1 and .rho..sub.MAX selected as the discrimination parameters for voiced and unvoiced sounds differ extremely as stated, no optimal coefficients and threshold value are determined.