There is a computer program listing of a software appendix which has been submitted in two (2) identical copies to the U.S. Patent and Trademark Office on CD-ROM, the contents of which are hereby incorporated by reference. These CD-ROM copies, created in October, 2002, contain the following files (in alphabetical order):
This invention relates to enhancement processing for speech coding (i.e., speech compression) systems, including low bit-rate speech coding systems such as MELP.
Low bit-rate speech coders, such as parametric speech coders, have improved significantly in recent years. However, low-bit rate coders still suffer from a lack of robustness in harsh acoustic environments. For example, artifacts introduced by low bit-rate parametric coders in medium and low signal-to-noise ratio (SNR) conditions can affect intelligibility of coded speech.
Tests show that significant improvements in coded speech can be made when a low bit-rate speech coder is combined with a speech enhancement preprocessor. Such enhancement preprocessors typically have three main components: a spectral analysis/synthesis system (usually realized by a windowed fast Fourier transform/inverse fast Fourier transform (FFT/IFFT), a noise estimation process, and a spectral gain computation. The noise estimation process typically involves some type of voice activity detection or spectral minimum tracking technique. The computed spectral gain is applied only to the Fourier magnitudes of each data frame (i.e., segment) of a speech signal. An example of a speech enhancement preprocessor is provided in Y. Ephraim et al., xe2x80x9cSpeech Enhancement Using a Minimum Mean-Square Error Log-Spectral Amplitude Estimator,xe2x80x9d IEEE Trans. Acoustics, Speech and Signal Processing, Vol. 33, pp. 443-445, April 1985, which is hereby incorporated by reference in its entirety. As is conventional, the spectral gain comprises individual gain values to be applied to the individual subbands output by the FFT process.
A speech signal may be viewed as representing periods of articulated speech (that is, periods of xe2x80x9cspeech activityxe2x80x9d) and speech pauses. A pause in articulated speech results in the speech signal representing background noise only, while a period of speech activity results in the speech signal representing both articulated speech and background noise. Enhancement preprocessors function to apply a relatively low gain during periods of speech pauses (since it is desirable to attenuate noise) and a higher gain during periods of speech (to lessen the attenuation of what has been articulated). However, switching from a low to a high gain value to reflect, for example, the onset of speech activity after a pause, and vice-versa, can result in structured xe2x80x9cmusicalxe2x80x9d (or xe2x80x9ctonalxe2x80x9d) noise artifacts which are displeasing to the listener. In addition, enhancement preprocessors themselves can introduce degradations in speech intelligibility as can speech coders used with such preprocessors.
To address the problem of structured musical noise, some enhancement preprocessors uniformly limit the gain values applied to all data frames of the speech signal. Typically, this is done by limiting an xe2x80x9ca priorixe2x80x9d signal to noise ratio (SNR) which is a functional input to the computation of the gain. This limitation on gain prevents the gain applied in certain data frames (such as data frames corresponding to speech pauses) from dropping too low and contributing to significant changes in gain between data frames (and thus, structured musical noise). However, this limitation on gain does not adequately ameliorate the intelligibility problem introduced by the enhancement preprocessor or the speech coder.
The present invention overcomes the problems of the prior art to both limit structured musical noise and increase speech intelligibility. In the context of an enhancement preprocessor, an illustrative embodiment of the invention makes a determination of whether the speech signal to be processed represents articulated speech or a speech pause and forms a unique gain to be applied to the speech signal. The gain is unique in this context because the lowest value the gain may assume (i.e., its lower limit) is determined based on whether the speech signal is known to represent articulated speech or not. In accordance with this embodiment, the lower limit of the gain during periods of speech pause is constrained to be higher than the lower limit of the gain during periods of speech activity.
In the context of this embodiment, the gain that is applied to a data frame of the speech signal is adaptively limited based on limited a priori SNR values. These a priori SNR values are limited based on (a) whether articulated speech is detected in the frame and (b) a long term SNR for frames representing speech. A voice activity detector can be used to distinguish between frames containing articulated speech and frames that contain speech pauses. Thus, the lower limit of a priori SNR values may be computed to be a first value for a frame representing articulated speech and a different second value, greater than the first value, for a frame representing a speech pause. Smoothing of the lower limit of the a priori SNR values is performed using a first order recursive system to provide smooth transitions between active speech and speech pause segments of the signal.
An embodiment of the invention may also provide for reduced delay of coded speech data that can be caused by the enhancement preprocessor in combination with a speech coder. Delay of the enhancement preprocessor and coder can be reduced by having the coder operate, at least partially, on incomplete data samples to extract at least some coder parameters. The total delay imposed by the preprocessor and coder is usually equal to the sum of the delay of the coder and the length of overlapping portions of frames in the enhancement preprocessor. However, the invention takes advantage of the fact that some coders store xe2x80x9clook-aheadxe2x80x9d data samples in an input buffer and use these samples to extract coder parameters. The look-ahead samples typically have less influence on the quality of coded speech than other samples in the input buffer. Thus, in some cases, the coder does not need to wait for a fully processed, i.e., complete, data frame from the preprocessor, but instead can extract coder parameters from incomplete data samples in the input buffer. By operating on incomplete data samples, delay of the enhancement preprocessor and coder can be reduced without significantly affecting the quality of the coded data.
For example, delay in a speech preprocessor and speech coder combination can be reduced by multiplying an input frame by an analysis window and enhancing the frame in the enhancement preprocessor. After the frame is enhanced, the left half of the frame is multiplied by a synthesis window and the right half is multiplied by an inverse analysis window. The synthesis window can be different from the analysis window, but preferably is the same as the analysis window. The frame is then added to the speech coder input buffer, and coder parameters are extracted using the frame. After coder parameters are extracted, the right half of the frame in the speech coder input buffer is multiplied by the analysis and the synthesis window, and the frame is shifted in the input buffer before the next frame is input. The analysis windows, and synthesis window used to process the frame in the coder input buffer can be the same as the analysis and synthesis windows used in the enhancement preprocessor, or can be slightly different, e.g., the square root of the analysis window used in the preprocessor. Thus, the delay imposed by the preprocessor can be reduced to a very small level, e.g., 1-2 milliseconds.
These and other aspects of the invention will be appreciated and/or obvious in view of the following description of the invention.