This invention relates in general to systems that reduce or remove perceptual distortion in distorted speech signals and, more specifically, to speech signals that have been reconstructed from a coded bit stream and that contain distortion resulting from the encoding-decoding process.
A large number of methods to remove or reduce audible distortion in speech signals currently exist. Methods designed for speech with acoustic background noise (such as car noise or so-called babble noise), generally are based on the assumption of statistical independence of the corrupting signal and the speech signal. As a result, such methods aimed at removing or reducing acoustic background noise (a typical example being described in the paper by Y. Ephraim and H. L. van Trees, “A signal subspace approach for speech enhancement”, IEEE Transactions on Speech and Audio Processing, Vol. 3, pp. 251–266, 1995) generally do not perform well on speech-correlated noise. With the reduction of speech-correlated noise, however, the corrupting signal and the speech signal are not statistically independent.
Existing enhancement systems for speech-correlated noise can be motivated using conventional source coding theory for stationary Gaussian processes (signals) with a mean-squared-error distortion criterion, which is well known to persons skilled in the art. (Although the speech signals do not have Gaussian distributions, it is generally held that this theory provides a good approximation for many types of signals.) For example, consider the decoded signal obtained from the encoding at a finite rate, R, of a stationary Gaussian signal. The reconstructed signal corresponding to the minimum mean-squared-error distortion between encoder and decoder can then be shown to have a power spectrum that is not identical to that of the original signal. It is found that the power spectrum of the reconstructed signal equals the power spectrum of the original signal minus the mean squared error. In general, the signal reconstruction has lower energy than the original signal. The decrease in the power spectrum is proportionally strongest in regions of low energy. In other words, the energy of the spectral valleys decreases proportionally more than that of spectral peaks, thus emphasizing the spectral shape.
In speech-coding algorithms, the analysis and synthesis models are generally identical. Thus, the results of source coding theory for Gaussian signals motivate an emphasis of the spectrum of the reconstructed signal by means of a post-filter. In a speech coder, the spectral structure of the signal is generally described by a set of signal-model parameters, and by filtering the output signal of the coder with an appropriate post-filter derived from the parameters, the spectral structure of the reconstructed signal can be emphasized. In general, this emphasis can be performed separately for the spectral fine structure and for the spectral envelope. For good performance, the emphasis of the output speech signal spectrum must be combined with an appropriate adjustment of the encoding. That is, the perceptual weighting that is generally present in the encoder part of state-of-the-art speech coders must be adjusted to account for the post-filter. The combination of a modified encoder and a decoder with added post-filter approximates a coding structure that is optimal for Gaussian signals. State-of-the-art coded-speech enhancement systems can generally be traced back to the work of Ramamoorthy and Jayant (V. Ramamoorthy and N. S. Jayant, “Enhancement of {ADPCM} Speech by Adap-tive Postfiltering”, AT&T Bell Labs. Tech. J., 1465–1475, 1984), who introduced an adaptive post-filter structure for the enhancement of coded speech.
The basic method of adaptive post-filtering was improved upon by Chen and Gersho (J.-H. Chen and A. Gersho, “Real-Time Vector APC Speech Coding at 4800 bps with Adaptive Postfiltering”, Proc. Int. Conf. Acoust. Speech Sign. Processing, Dallas, 2185–2188, 1987). They introduced the adaptive post-filter structure containing both poles and zeros that is commonly in use today. Typically, this structure is used for the well-known class of linear-prediction based analysis-by-synthesis coders. A good overview of the various flavors of adaptive post-filtering for coded speech enhancement on linear-prediction based (or auto-regressive, AR, model based) speech coders was given in a paper by Chen and Gersho in 1995 (J.-H. Chen and A. Gersho, “Adaptive Postfiltering for Quality Enhancement of Coded Speech”, IEEE Trans. Speech Audio Process., 3, 1, 59–71, 1995). In the 1995 Chen and Gersho paper, it is shown that, generally, separate post-filters are used to enhance the structure of the spectral fine structure and the spectral envelope. In all these methods, the adaptive post-filter parameter settings are based on the linear predictor of the speech coder. Feedback is used only to ensure that the short-term signal power of the enhanced signal approximates that of the distorted signal.
Particular care must be taken with the post-filter associated with the spectral fine structure. To prevent discontinuities in the short-term correlations whenever the spectral-fine-structure post-filter is adapted, this fine-structure post-filter is generally located prior to the autoregressive (AR) filter used to reconstruct the speech spectral envelope. Since the post-filter associated with the spectral fine structure has an implicit delay, the location of this post-filter results in a mismatch between the time location of the spectral envelope and the spectral fine structure. This problem can be mitigated with a solution described in publications by Kleijn (W. B. Kleijn, “Improved Pitch-period Prediction”, Proc. IEEE Workshop on Speech Coding for Telecomm., Sainte-Adele, Quebec, 19–20, 1993 and also in W. B. Kleijn, “Method and Apparatus for Smoothing Pitch-Cycle Waveforms”, U.S. Pat. No. 5,267,317, Nov. 30, 1993).
Post-filters have also been used in association with the well-known sinusoidal coders and waveform-interpolation coders. In these coders, the post-filtering is generally associated only with the spectral envelope. This is natural, since these coders have a particular structure that generally results in little perceived distortion being the result of noise signals located in the local spectral valleys. Instead, most of the perceived distortion results from distortion located in the global spectral valleys. Descriptions of these post-filtering methods can be found in R. J. McAulay and T. F. Quatieri, “Sinusoidal Coding”, in Speech Coding and Synthesis, W. B. Kleijn and K. K. Paliwal, Eds., Elsevier, Amsterdam, 175–208, 1995, and W. B. Kleijn and J. Haagen, “Waveform interpolation for speech coding and synthesis”, in Speech Coding and Synthesis, W. B. Kleijn and K. K. Paliwal, Eds., Elsevier, Amsterdam, 175–208, 1995, respectively.
In the appended figures, similar components and/or features may have the same reference label.