Sensorineural loss is caused by degeneration of the sensory hair cells of the inner ear or the auditory nerve. Persons with such loss experience severe difficulty in speech perception in noisy environments. Suppression of wide-band non-stationary background noise as part of the signal processing in hearing aids and other speech communication devices can serve as a practical solution for improving speech quality and intelligibility for persons with sensorineural or mixed hearing loss. Many signal processing techniques developed for improving speech perception require noise-free speech signal as the input and these techniques can benefit from noise suppression as a pre-processing stage. Noise suppression can also be used for improving the performance of speech codecs, speech recognition systems, and speaker recognition systems under noisy conditions.
For implementing the noise suppression on a low-power processor in a hearing aid or a communication device, the technique should have low algorithmic delay and low computational complexity. Spectral subtraction (M. Berouti, R. Schwartz, and J. Makhoul, “Enhancement of speech corrupted by acoustic noise,” Proc. IEEE ICASSP 1979, pp. 208-211; S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. Acoust., Speech, Signal Process., vol. 27, no. 2, pp. 113-120, 1979) can be used as a single-input speech enhancement technique for this application. A large number of variations of the basic technique have been developed for use in audio codecs and speech recognition (P. C. Loizou, “Speech Enhancement: Theory and Practice,” CRC Press, 2007). The processing steps are segmentation and spectral analysis, estimation of the noise spectrum, calculation of the enhanced magnitude spectrum, and re-synthesis of the speech signal. Due to non-stationary nature of the interfering noise, its spectrum needs to be dynamically estimated. Under-estimation of the noise results in residual noise and over-estimation results in distortion leading to degraded quality and reduced intelligibility. Noise can be estimated during the silence intervals identified by a voice activity detector, but the detection may not be satisfactory under low SNR conditions and the method may not correctly track the noise spectrum during long speech segments.
Several techniques based on minimum statistics for estimating the noise spectrum, without voice activity detection, have been reported (R. Martin, “Spectral subtraction based on minimum statistics,” Proc. EUSIPCO 1994, pp. 1182-1185; I. Cohen, “Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging,” IEEE Trans. Speech Audio Process., vol. 11, no. 5, pp. 466-475, 2003; G. Doblinger, “Computationally efficient speech enhancement by spectral minima tracking in subbands,” Proc. EUROSPEECH 1995, pp. 1513-1516). These techniques involve tracking the noise (as minima of the magnitude spectra of the past frames and are suitable for real-time operation. However, they often underestimate the noise and need estimation of an SNR-dependent subtraction factor. In the absence of significant silence segments, processing may remove some parts of the speech signal during the weaker speech segments. Stahl et al. (V. Stahl, A. Fisher, and R. Bipus, “Quantile based noise estimation for spectral subtraction and Wiener filtering,” Proc. IEEE ICASSP 2000, pp. 1875-1878) reported that a quantile-based estimation of the noise spectrum from the spectrum of the noisy speech can be used for spectral subtraction based noise suppression. It is based on the observation that the signal energy in a particular frequency bin is low in most of the frames and high only in 10-20% frames corresponding to voiced speech segments. For improving word accuracy in a speech recognition task, a time-frequency quantile based noise estimation was reported by Evans and Mason (N. W. Evans and J. S. Mason, “Time-frequency quantile-based noise estimation,” Proc. EUSIPCO 2002, pp. 539-542). These quantile-based noise estimation techniques use quantiles obtained by ordering the spectral samples or from dynamically generated histograms. Due to large memory space required for storing the spectral samples and high computational complexity, they are not suited for use in hearing aids and communication devices. Use of median, i.e. 0.5-quantile, considerably reduces the computation requirement, but still does not permit real-time implementation. Waddi et al. (S. K. Waddi, P. C. Pandey, and N. Tiwari, “Speech enhancement using spectral subtraction and cascaded-median based noise estimation for hearing impaired listeners,” Proc. NCC 2013, paper no. 1569696063) used a cascaded-median as an approximation to median for real-time implementation of speech enhancement. The improvements in speech quality were found to be different for different types of noises, indicating the need for using frequency-bin dependent quantiles for suppression of non-white and non-stationary noises.
Kazama et al. (M. Kazama, M. Tohyama, and T. Hirai, “Current noise spectrum estimation method and apparatus with correlation between previous noise and current noise signal,” U.S. Pat. No. 7,596,495 B2, 2009) have disclosed a method for updating the noise spectrum based on the correlation between the envelope of previously estimated noise spectrum and the envelope of the current spectrum of the input. It has high computational complexity due to the need for calculating the spectral envelopes and the correlation. As all the spectral samples of the noise are updated using a single mixing ratio, the method may not be effective in suppressing non-stationary non-white noises.
In a noise suppression method disclosed by Schmidt et al. (G. U. Schmidt, T. Wolff, and M. Buck, “System for speech signal enhancement in a noisy environment through corrective adjustment of spectral noise power density estimations,” U.S. Pat. No. 8,364,479 B2, 2013), the noise spectrum is estimated using moving average and minimum statistics and a frequency-dependent correction factor is obtained using the variance of relative spectral noise power density estimation error, estimated noise spectrum, and the input spectrum. The relative spectral noise power density estimation error is calculated during non-speech frames whose identification requires a voice activity detector and minimum statistics based noise estimation requires an SNR-dependent subtraction factor, leading to increased computational complexity.
In a method for estimating noise spectrum using quantile-based noise estimation, disclosed by Jabloun (F. Jabloun “Quantile based noise estimation,” UK patent No. GB 2426167 A, 2006), spectra of a fixed number of past input frames are stored in a buffer and sorted using a fast sorting algorithm for obtaining the specified quantile value for each spectral sample. A recursive smoothening is applied on the quantile-estimated noise spectrum, using smoothening parameter calculated from the estimated frequency-dependent SNR. Although the method does not need a voice activity detector, it requires a large memory for buffering the spectra. For reducing the high computational complexity due to sorting operations, the quantile computations are restricted to a small number of frequency samples and the noise spectrum is obtained using interpolation, restricting the effectiveness of the method in case of non-stationary non-white noise.
Nakajima et al. (H. Nakajima, K. Nakadai, and Y. Hasegawa, “Noise power estimation system, noise power estimating method, speech recognition system and speech recognizing method,” U.S. Pat. No. 8,666,737 B2, 2014) have described a method for estimating the noise spectrum using a cumulative histogram for each spectral sample which is updated at each analysis window using a time decay parameter. Although the method does not require large memory for buffering the spectra, it has high computational complexity and the estimated quantile values can have large errors in case of non-stationary noise.
Thus for noise signal suppression in speech signals in hearing aids and speech communication devices, there is a need to mitigate the disadvantages associated with the methods and systems described above. Particularly, there is a need for noise signal suppression without involving voice activity detection and without needing large memory and high computational complexity.