The present invention relates to a speech analysis and synthesis apparatus and, more particularly, to an apparatus of this type having a digital filter of improved stability for speech synthesis and having minimized deterioration of speech quality and minimized reduction in transmission information arising from transmission error and quantizing error.
Further reduction in the frequency band used in the encoding of voice signals has been increasingly demanded as a result of the gradually increasing practice of the composite transmission of the speech-facsimile signal combination or the speech-telex signal combination or the use of multiplexed speech signals for the purpose of more effective use of telephone circuits.
In the band reduction encoding, the speech sound is expressed in terms of two characteristic parameters, one for speech sound source information and the other for the transfer function of the vocal tract. In the speech analysis and synthesis technique, the speech waves voiced by a human are assumed to be radiation output signals radiated through the vocal tract which is excited by the vocal cords to function as a speech sound source, and the spectral distribution information equivalent to the speech sound source information and the transfer function information of the vocal tract is sampled and encoded on the speech analyzer side for transfer to the synthesizer side. Upon receipt of the coded information, the synthesizer side uses the spectral distribution information to determine the coefficient of a digital filter for speech synthesis and applies the speech source information to the digital filter to reproduce the original speech signal.
Generally, the spectral distribution information is expressed by the spectral envelope representative of spectral distribution and the resonance characteristic of the vocal tract. As is well known, the speech sound information is the residual signal resulting from the subtraction of the spectral envelope component from the speech sound spectrum. The residual signal has a spectral distribution over the entire frequency range of the speech sound, and has a complex waveform to represent the residual signal in terms of digitized information is not consistent with the aim of band reduction encoding. In general, however, a voiced sound produced by vibration of the vocal cords is represented by a train of impulses which has an envelope shape analogous to the waveform of the voiced sound and the same pitch as that of the voiced sound while, unvoiced sound produced by air passing turbulently through constrictions in the tract is expressed by the white noise. Therefore, the band reduction of the speech sound information is usually carried out by using the impulse train and the white noise for representing the voiced and unvoiced sounds.
As described above, the spectral envelope is used to express the spectral distribution information and to distinguish between the voiced and unvoiced sounds, while pitch period and sound intensity are employed for the speech sound source information. A spectral variation of the speech wave is relatively slow because the speech signal is produced through motions of the sound adjusting organs such as tongue and lips. Accordingly, a spectral variation for a 20 to 30 msec period can be held constant. For analysis and synthesis purposes, therefore, every 20 msec portion of the speech signal is handled as an analysis segment or frame, which serves as a unit for the extraction of the parameters to be transferred to the synthesis side. On the synthesis side, the parameters transferred from the analysis side are used to control the coefficients of a synthesizing filter and as the exciting input on the analysis frame-by-analysis frame basis, for the reproduction of the original speech.
To extract the above-mentioned, parameters, the so-called linear prediction method is generally used (For details, reference is made to an article titled "Linear Prediction: A Tutorial Review" by JOHN MAKHOUL, PROCEEDINGS OF THE IEEE, VOL. 63, No. 4, APRIL 1975). The linear prediction method is based on the fact that a speech waveform is predictable from linear combinations of immediately preceding waveforms. Therefore, when applied to the speech sound analysis, the speech wave data sampled is generally given as ##EQU1## where S(n) is the sample value of the speech voice at a given time point; S(n-i), the sample value at the time point i samples prior thereto; P, the linear predictor; Sn, the predicted value of the sample at the given time point, Un is the predicted residual difference; and .alpha..sub.i, the predictor coefficient. The linear predictor coefficient .alpha..sub.i has a predetermined relation with the correlation coefficients taken from the samples. It is therefore obtainable recursively from the extraction of the correlation coefficients, which are then subjected to the so-called Durbin method (Reference is made to the above-cited article by JOHN MAKHOUL). The linear predictor coefficient .alpha..sub.i thus obtained indicates the spectrogram envelope information and is used as the coefficient for the digital filter on the synthesis side.
As the parameter representing the spectral envelope of the speech sound, the variation in the cross sectional area of the vocal tract with respect to the distance from the larynx is often employed, the variation meaning the reflection coefficient of the vocal tract and being called the partial autocorrelation coefficient, PARCOR coefficient or K parameter hereunder. The K parameter determines the coefficient of a filter for synthesizing the speech sound. When .vertline.K.vertline.&gt;1, the filter is unstable, as is known, so that the stability of the filter can be checked by using the K parameter. Thus, the K parameter is of importance. Additionally, the K parameter is coincident with a K parameter appearing as an interim parameter in the course of the computation by the above-mentioned recursive method and is expressed as a function of a normalized predictive residual power (see the above-mentioned article by J. MAKHOUL). The normalized predictive residual power is defined as a value resulting from dividing u in the equation (1) by the power of the speech sound in the analysis frame.
The exposition of the speech analysis and synthesis is discussed in more detail in an article "Speech Analysis and Synthesis by Linear Prediction of the Speech Wave " by B. S. ATAL AND SUZANNE L. HANAVER, The Journal of the Acoustic Society of America. Vol. 50, Number 2 (Part 2), 1971, pp. 637 to 655.
The conventional speech analysis and synthesis apparatus of this kind has a very limited computational speed due to the limitation on the scale of the apparatus allowed therefor. The arithmetic unit of a limited accuracy arithmetic such as one based on a limited word length with fixed decimal point is usually employed for such apparatus. The normalized predictive residual power is relatively small in the voiced sound with high periodicity but relatively large in the unvoiced sound with low periodicity, and its value is lower as the analyzing order is higher (see the article by ATAL et al, FIG. 5 on page 642, for example).
The conventional speech analysis and synthesis apparatus has a synthesis filter of a fixed number of stages corresponding to the number of order for the linear predictor coefficient. Therefore, when a waveform of extremely high periodicity, i.e., of clear spectrogram structure, such as the stationary part of a voiced sound, is processed, the normalized predictive residual power tends to be smaller than the smallest significant value that can be handled by the above-mentioned limited accuracy arithmetic. More definitely, this means that the K parameters, which are given as a function of the normalized predictive residual power, tend to be .vertline.K.vertline.&gt;1, adversely affecting the stability of the synthesis filter. The window processing applied to successive prefixed lengths of sound waveform may help increase the normalized predictive residual power, because the window length rarely equals an integral multiple of the pitch period of the sound even if it is of high periodicity and, consequently because the spectral structure of the sound waveform within a single window length has a lower clarity. Such increased normalized predictive residual power may help avoid the above-mentioned instability of the synthesis filter. However, the use of the window processing does not necessarily mean an increase in the predictive residual power sufficient to contribute to the stability of the synthesis filter, because a high-pitched voice sound, such as a female voice, has a sufficient periodicity within a very short window length to lower the predictive residual power.
When the linear predictor coefficient for the analysis is made to be of high order while the number of stages of the synthesizing digital filter is reduced to overcome such difficulty, the approximation of the spectral envelope of a less stationary speech sound or of the voiced sound having a relatively large predictive residual compared power with the arithmetic accuracy is considerably reduced, deteriorating the quality of the synthesized speed sound.
The calculation of the linear predictor coefficient under a high ambient noise involves errors since the signal wave to be analysed is the superposition of the ambient noise on the speech wave. The spectral envelope calculated from the linear predictor coefficient affected by the ambient noise is different from the spectral envelope of the original speech wave. Under the influence of the ambient noise, the linear predictor coefficient must be analyzed to remove the influence by the ambient noise. Such analysis is usually carried out by using an autocorrelation coefficient as follows. The autocorrelation coefficient .rho.(SN)(SN).tau. of a noise-affected speech sound at a delay .tau. is given as ##EQU2## where S.sub.0, S.sub.1, S.sub.2, . . . are a series of samples of a speech sound wave; n.sub.0, n.sub.1, n.sub.2, . . . , a series of samples of a noise wave; S.sub.0 +N.sub.0, S.sub.1 +N.sub.1, S.sub.2 +N.sub.2, . . . , a series of samples of a noise-affected speech sound; N, the number of samples of a waveform to be analyzed; and i, the number of each sample. The right side of the above equation is rewritten in the form of the autocorrelation: EQU .rho.(SN)(SN).tau.=.rho..sub.(S)(S).tau. -.rho..sub.(N)(N).tau. +.rho..sub.(N)(SN).tau.
where ##STR1## Generalizing the delay .tau., .rho..sub.(SN)(SN).tau. is defined as the first autocorrelation coefficient and (.rho..sub.(SN)(N).tau. -.rho..sub.(N)(N).tau. +.rho..sub.(N)(SN).tau.) is defined as the second autocorrelation coefficient. Under this definition, the autocorrelation of a speech sound is expressed as a difference between the first and second autocorrelation coefficients.
As described above, to obtain the parameter to correctly express only the feature of the speed sound under high ambient noise, the autocorrelation of the speech sound is expressed in terms of the difference between the first and second autocorrelation coefficients. More specifically, a conventional method employs an acoustic-to-electrical signal converting unit for noise detection as well as an acoustic-to-electrical signal converting unit for speech signal detection. With these units, the acoustic signal from a noise source and the acoustic signal from a speaker are detected as a synthesis acoustic signal while at the same time only the acoustic signal derived from the noise source is detected. Then, the autocorrelation coefficient of the noise-affected speech sound and the autocorrelation coefficient of the noise are measured. Following this, the correlation coefficient between the noise-affected speech signal is measured from the above two kinds of signals. Similarly, the correlation coefficient between the noise and the noise-affected speech signal is measured. Then, the autocorrelation coefficient of the speech sound signal is measured on the basis of the two autocorrelation coefficients, and the linear-predictor coefficient is measured on the basis of the autocorrelation coefficient of the speech signal. In the conventional method, however, when the spatial distances from the noise source to the acoustic to electrical signal converters for signal detection and noise detection are different from each other, no linearity or analogy exists between the input speech signals to both converting units. Therefore, the relation established may be inaccurate among the autocorrelation coefficient of the speech signal relative to the autocorrelation coefficient of the noise-affected speech signal, the autocorrelation coefficient of the noise, the correlation coefficient between the noise-affected speech signal and the noise, and the correlation coefficient between the noise and the noise-affected speech signal.
As a result, there is a possibility that the autocorrelation coefficient measured of the speech sound at delay .tau. becomes larger than that of the sound per se. Specifically, when the autocorrelation value at delay .tau. is normalized to "1", the autocorrelation value of the speech sound measured at delay .tau. may be closer to "1", compared to that of the speech sound per se, and, as the case may be, it exceeds "1". When the autocorrelation value exceeds "1", the synthesizing filter with the coefficient which is the linear predictor coefficient calculated from the autocorrelation coefficient becomes unstable. This is seen, for example, from the fact that when the linear predictor coefficient is of first degree, the K parameter which is the interim parameter in the calculation of the linear predictor coefficient by the Durbin method exceeds "1".
The above-mentioned conventional method to obtain the linear predictor coefficient for the purpose of expressing correctly only the feature of the speech sound under the condition of high ambient noise, has a disadvantage that the speech synthesis filter with the obtained linear predictor coefficient as its coefficient becomes unstable because of the influence of noise. As described above, the conventional method first measures the autocorrelation coefficient of the speech sound on the basis of the autocorrelation coefficient of the noise-affected speech sound, the autocorrelation of noise, the correlation coefficient between the noise-affected speech sound and noise, and the correlation coefficient between noise and the noise-affected speech sound, and then obtains the linear predictor coefficient depending on the autocorrelation coefficient measured of the speech sound.
Evidently, the conventional method suffers from the same disadvantage when the noise source has a spatially large volume, or when the transfer function in the acoustic area ranging from the noise source to the converter for speech sound detection is different from that in the acoustic area from the noise source to the converter for noise detection. In the characteristic parameters of the speech sound obtained on the analysis side, the speech sound source information, particularly the normalized predictive residual power representative of the amplitude information or the complex parameter of a short time average power and a normalized predictive residual power, have a much larger rate of time variation than that of the linear predictor coefficient .alpha. or the K parameter. This arises from the fact that, while K parameter representative of the reflection coefficient of the vocal tract depends on the cross sectional area of the vocal tract changing with muscular motion of a human and therefore slowly varies with time, the normalized predictive residual power U as expressed by EQU Up=(1-Ki.sup.2) (2)
where Ki is the K parameter of i-th order and p is the number of order, is affected by the amplification of all the changes of the respective Ki's and therefore its variation is complicated and steep.
For this reason, in the analysis of the parameter including the normalized predictive residual power, the analysis frame length must be set shorter than that of the analysis frame required for analyzing the other parameters such as the linear predictor coefficient and the like, resulting in the increase of transmission capacity.
Since the time variation of the parameters including the normalized predictive residual power is signficant, the parameters are easily influenced by transmission error due to external and internal causes in the course of the transmission. Further, when the parameters are quantized they involve quantization error. When the normalized predictive residual power influenced by such errors is applied as the amplitude information of the original speech sound to the synthesizing filter, the reproducibility of the amplitude is, of course, poor. Specifically, in the conventional apparatus, the linear predictor coefficient is exactly coincident with the normalized predictive residual power representative of the spectral envelope of the speech sound on the analysis side, while, on the synthesis side, the normalized predictive residual power is largely influenced by the above errors but the linear predictor coefficient is little effected by errors. Therefore, the speech sound synthesized by using both the factors is poor in amplitude reproducibility.