The spectral subtraction method (herein after referred to as the “SS method”), the Wiener filtering method, the minimum mean-squared error (MMSE) method and the like have been heretofore known as techniques for suppressing noise components in an observed signal based on a speech on which noises are superimposed.
The existence of stationary noise is a prerequisite for the SS method. The SS method is designed to learn an average power of noise components for each frequency in a noise section, which is a non-speech section, and to subtract the average power of the noise signal from the power of the observed signal in a speech section for each frequency (see Non-patent Document 1, for example). When the subtraction is done, the average power of the noise components is normally multiplied by an excessive subtraction weight in a range of 1.0 to 4.0. When an output as a result of the subtraction drops below 0.01 to 0.5 times the power of the original speech signal, processing or “flooring” is performed together where the result of the subtraction is replaced with a value which is obtained by multiplying the original speech signal by a “flooring” coefficient.
If a larger subtraction weight is introduced, a “musical” noise is reduced. However, loss of information and speech distortion in a speech section become conspicuous. For this reason, a larger flooring coefficient is needed for compensating for the loss of information and the speech distortion. Nevertheless, if a lager flooring coefficient is introduced, the power of a noise signal is not reduced sufficiently. If, therefore, there would be a measures to inhibit a musical noise from being produced even in a case that a small subtraction weight in a range of 1.0 to 1.5 is used, the loss of a speech and a speech distortion to be brought about after the subtraction can be suppressed to a minimum, and concurrently a smaller flooring coefficient in a range of 0.01 to 0.1 can be introduced. Accordingly, the power of the noise signal can be reduced sufficiently.
The following literature is considered:                [Non-patent Literature 1] S. Boll, “Suppression of Acoustic Noise in Speech Using Spectral Subtraction,” IEEE Trans. on ASSP, Vol. ASSP-27, pp. 113-120, April 1979        [Non-patent Literature 2] Lockwood, P., Boudy, J., “Experiments with a Nonlinear Spectral Subtractor (NSS), Hidden Markov Models and Projection, For Robust Recognition in Car,” Speech Commun, Vol. 11, pp. 215-228, June 1992        [Non-patent Literature 3] J. A. Nolazco Flores, S. J. Young, “Continuous Speech Recognition in Noise Using Spectral Subtraction and HMM Adaptation,” Proc. of ICASSP, 1994, Vol. I, pp. 409-412        [Non-patent Literature 4] Gary Whipple, “Low Residual Noise Speech Enhancement Utilizing Time-Frequency Filtering,” ICASSP-94        [Non-patent Literature 5] Y. Ephraim, D. Malah, “Speech Enhancement Using a Minimum Mean-Squared Error Short-Time Spectral Amplitude Estimator,” IEEE Trans. on ASSP, Vol. ASSP-32, pp. 1109-1121        
The SS method has a plurality of derivative methods. Among them are a non-linear spectral subtraction (NSS) method, which is designed to adjust only a subtraction weight for each frequency in response to a signal-to-noise ratio (SNR)(see Non-patent Literature 2, for example), and a continuous spectral subtraction (CSS) method, which is designed to subtract a local average power in a real-time manner without discriminating between a noise section and a speech section (see Non-patent Literature 3, for example). In these methods, however, a musical noise is produced, even though their levels of the musical noise is lower.
A post-mortem method has been proposed where an output to be obtained after processing by the SS method is observed and a musical noise and its equivalent are reduced if they are found. Specifically, a power of a spectrum is observed in the system of coordinates constituted of a time axis and a frequency axis, thereby erasing a portion which looks like an isolated island (see Non-patent Literature 4, for example), or thereby reducing it with a median filtering. In addition, there is a spectral smoothing method for smoothing powers over several neighboring frames. However, these methods have their own limits, and performance in reducing a musical noise is insufficient.
To begin with, a musical noise results from “subtraction” processing. It is assumed that a musical noise is not produced if a speech signal to be obtained after reducing a noise component is produced by “multiplication” instead of “subtraction.”
The Wiener filtering method is designed to estimate a clean speech with some measures, and to define a transfer function of the Wiener filtering in a way that the transfer function agrees with the estimated clean speech. In this point, since the clean speech is unknown by nature, an estimated value concerning the speech is used. Depending upon measures to estimate the estimated value, therefore, the property of the Wiener function to be implemented varies to a large extent. Generally speaking, even though this method is employed, it is difficult to make reduction in a residual noise and minimization of speech distortion compatible with each other.
The MMSE method is designed to adjust a multiplication coefficient for each frequency by use of a minimum square method on a presumption that independent power distributions are present in a noise and a speech respectively (see Patent Literature 5, for example). Since multiplication is done, a musical noise is not produced. However, a speech processed by the MMSE method has a large amount of speech distortion. This speech distortion is conspicuous, particularly in a case that the speech distortion is measured by a widely-used MEL-cepstral representation. For this reason, the MMSE method is not suitable for its adaptation to speech recognition.
It is desirable to achieve clear speech in a severe noise environment such as an emergency telephone call made in a highway. In addition, a speech enhancement technique for offering higher articulation has been awaited in the field of hearing aids for people with hearing impairment.
An SS method which is designed to subtract an average spectrum of noise components from an observed signal is effective for reducing noise components from an observed signal based on a speech on which a stationary noise is superimposed. However, a conventional SS method can not avoid producing an offensive musical noise as a side effect.
In other words, in the present framework of the SS method, clarity of a speech and performance in speech recognition can not be compatible with each other. For the purpose of suppressing speech distortion to a minimum level, it is desirable to introduce a smaller subtraction weight. When the subtraction weight is set smaller, however, noise components which can not be subtracted are large in number, thus deteriorating performance in speech recognition in a noise environment. For the purpose of lowering the overall noise power including noise power in non-speech sections, it is desirable to introduce a smaller flooring coefficient. When the flooring coefficient is set smaller, however, a musical noise is conspicuous, thus causing errors to crop up with regard to a short word. Consequently, if performance in speech recognition is intended to be enhanced with priority given, clarity of a speech in terms of auditory sense may be sacrificed in some cases.
For the same reason, in a conventional SS method, performance in speech recognition based on an observed signal to be obtained after noises are reduced is susceptible to an influence caused by the two parameters of a subtraction weight and a flooring coefficient. Optimal parameter values vary depending upon the quantities (S/N) and qualities of noises and further on a task of speech recognition. For this reason, the optimal parameter values are somewhat difficult to obtain in an actual environment. To achieve more robust speech recognition, a method for reducing noises which is not sensitive to variation of the parameters has been awaited.