1. Field of the Invention
The present invention relates to signal processing for a signal such as a speech signal.
2. Description of the Related Art
In many digital signal processing (DSP) systems, an input signal is processed by fast Fourier transform (FFT), or a similar operation, to yield a frequency-domain representation of the signal. In the case of the FFT, this representation is a vector of complex values in which squaring and adding the real and imaginary values to give a vector of real values yields a vector known as the periodogram. The periodogram is sometimes referred to as the PSD (Power Spectral Density), and the term PSD is used here for brevity. The PSD is a useful representation because if the signal is assumed to be the sum of two independent signals, the PSD is also approximately the sum of the two independent PSDs.
In audio DSP, the input signal often consists of two signals: a speech signal being a representation of the sound of a person speaking, and a noise signal being circuit noise generated by an electronic circuit, or background noise from machinery, vehicles or the like. Two distinct applications depend on the ability to remove the noise signal from the total signal to give a clean speech signal:
Automatic Speech Recognition (ASR)—the goal of ASR is to recognize the sounds spoken by a user and perform some action based on those sounds. The action may be to transcribe the speech or to operate a machine based on commands spoken. ASR systems are usually only receptive to clean speech. If noise-corrupted speech is applied to an ASR system, the performance decreases drastically.
Speech Enhancement—the goal of speech enhancement is to produce a clean, audible, speech signal given a noisy speech signal. For instance, if one user speaking into a telephone is standing near a noisy machine, a second user listening on the other telephone hears both the first user and the machine. The second user would prefer to hear just the first user without the machine; this can be achieved by the speech enhancement.
In the above example applications, a procedure known as Spectral Subtraction (SS) is often used to remove noise from a signal. The basic premise is that, as the speech and noise PSDs are additive, the speech can be recovered by simply subtracting an estimate of the noise.
A typical SS procedure is as follows, and also illustrated in FIG. 1. Note that FIG. 1 is a block diagram that shows construction of a pre-processing part of speech recognition processing including SS.
An Hartley transformation unit 16 inputs a signal divided into overlapping frames, and transforms the input signal into information in a frequency domain. A periodogram calculator 17 calculates a PSD of the input signal.
A noise estimation unit 32 calculates an average noise PSD over several frames during a period of silence, when the person is not speaking and only the noise is present.
A spectral subtraction (SS) unit 33 subtracts the average noise PSD from the calculated PSD for each frame to obtain a de-noised or clean speech PSD.
In the case of ASR, the clean speech PSD is then filtered using a mel-scaled filter 18 to produce a PSD vector that is shorter than the original PSD. The logarithm of the mel scaled PSD is then calculated by a logarithm calculator 19 before being further processed for use as a feature for a pattern recognition algorithm such as an Hidden Markov Model (HMM).
In the case of enhancement, the de-noised speech PSD is combined with the noise PSD to form, for example, a Wiener filter. The Weiner filter is then used to weight the complex FFT result, which is then inverted using the IFFT (Inverse FFT). Finally, an overlap and add process is applied to give a reconstructed audio signal.
The main problem with the above process is that the noise estimation unit 32 and the SS unit 33 are imperfect. In the case of noise estimation, the estimate is calculated from a finite number of PSD frames. If only a small number of frames is available for noise calculation, the estimate is unlikely to be accurate. This in turn adds to the second, otherwise independent, problem:
As the PSD has random variation, the SS process can sometimes give a clean speech PSD result that is zero or negative. As all PSD values must be positive (by definition), some correction is required. Simply flooring negative PSD values to zero is known not to work well. In the ASR case, a subsequent operation is a logarithm that causes near-zero values to approach minus infinity—well out of the normal range for such features. In enhancement, the small values lead to the phenomenon of musical noise—tones resembling music introduced into the signal.
Two distinct solutions to the zero PSD problem are commonly used:
Flooring—in ASR, the result of SS is not allowed to fall below a flooring value, normally a scaled version of the PSD before SS.
Temporal Filtering—in enhancement, the SS value is floored at zero, but is then filtered temporally such that the final value is a linear combination of the raw SS and the result from the previous frame. The applicant has found such filtering not to be beneficial for ASR.
The concepts of speech enhancement, Wiener filtering and spectral subtraction are well known in the art and are described in the book “Discrete Time Speech Signal Processing” by Quatieri, ISBN 0-13-242942-X.
The concepts of ASR and mel filtering are well known in the art and are described in the book “Fundamentals of Speech Recognition” by Rabiner and Juang, ISBN 0-13-015157-2.
Kalman filtering is well known in the art and is described in the book “Statistical Signal Processing—Detection, Estimation and Time Series Analysis” by Scharf, ISBN 0-201-19038-9.
Temporal smoothing of spectral bins is well known in the art and is described in the paper “Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator” by Ephraim and Malah in IEEE Transactions on Acoustics Speech and Signal Processing, volume 32, no. 6, pages 1109 to 1121.
Brumitt (U.S. Pat. No. 6,931,292) describes an enhancement technique that uses both temporal and transversal (frequency) smoothing. The transversal smoothing is an FIR filter rather than a recursive filter, and the coefficients are fixed rather than dependent on the position in the PSD.
Fingscheidt (WO 02095732 and ICASSP 2005 volume I page 1081) also describes a spectral filter that depends upon adjacent spectral bins. However the coefficients do not depend on the position in the PSD. The spectral filter in this case is also temporal, whereas the invention strives to avoid temporal filtering of the PSD.
Cheng and Agarwal (US Application 20030018471) describe a state of the art noise removal system for ASR. The system uses similar and techniques to those in the invention as well as additional one, such as Wiener filtering. It does not, however, incorporate a Kalman-like recursive filter, and is substantially more computationally complex.