Substantially Real-time enhancement of speech in hearing aids is a challenging task due to e.g. a large diversity and variability in interfering noise, a highly dynamic operating environment, real-time requirements and severely restricted memory, power and MIPS in the hearing instrument. In particular, the performance of traditional single-channel noise suppression techniques under non-stationary noise conditions is unsatisfactory. One issue is the noise estimation problem, which is known to be particularly difficult for non-stationary noises.
Traditional noise estimation techniques are based on recursive averaging of past noisy spectra, using the blocks that are likely to be noise only. The update of the noise estimate is commonly controlled using a voice-activity detector (VAD), see for example TIA/EIA/IS-127, “Enhanced Variable Rate Codec, Speech Service Option 3 for Wideband Spread Spectrum Digital Systems”, July 1996.
In the article by I. Cohen, “Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging”, IEEE Trans. Speech and Audio Processing, vol. 11, no. 5 pp. 466-475, September 2003, the update of the noise estimate is conducted on the basis of a speech presence probability estimate.
Other authors have addressed the issue of updating the noise estimate with the help of order statistics, e.g. R. Martin, “Noise power spectral density estimation based on optimal smoothing and minimum statistics”, IEEE Trans. Speech and Audio Processing, vol. 9, no. 5 pp. 504-512, July 2001, and V. Stahl et al., “Quantile based noise estimation for spectral subtraction and Wiener filtering”, in Proc. IEEE Trans. Int. Conf. Acoustics, Speech and Signal Processing, vol. 3, pp. 1875-1878, June 2000, both of which are hereby incorporated by reference in its entirety.
The methods disclosed in the above mentioned documents are all based on recursive averaging of past noisy spectra, under the assumption of stationary or weakly non-stationary noise. This averaging inherently limits their noise estimation performance in environments with non-stationary noise. For instance, the method of R. Martin referred to above have an inherent delay of 1.5 seconds before the algorithm reacts to a rapid increase of noise energy. This type of delay in various degrees occurs in all above mentioned methods.
In recent speech enhancement systems this problem is addressed by using prior knowledge of speech (e.g. Y. Ephraim, “A Bayesian estimation approach for speech enhancement using hidden Markov models”, IEEE Trans. Signal processing, vol. 40, no 4, pp. 725-735, April 1992, hereby incorporated by reference in its entirety, and Y. Zhao, “Frequency domain maximum likelihood estimation for automatic speech recognition in additive and convolutive noises”, IEEE Trans. Speech and Audio Processing, vol. 8, no 3, pp. 255-266”, May 2000, which is hereby incorporated by reference in its entirety). While the method of Y. Ephraim does not directly improve the noise estimation performance, the use of prior knowledge of speech was shown to improve the speech enhancement performance for the same noise estimation method. The extension in the method by Y. Zhao referred to above allows for estimation of the noise model using prior knowledge of speech. However, the noise considered in the Y. Zhao method was based on a stationary noise model.
In other recent speech enhancement systems this problem is addressed by using prior knowledge of both speech and noise to improve the performance of speech enhancement systems. See for example e.g. H. Sameti et al., “HMM-based strategies for enhancement of speech signals embedded in nonstationary noise”, IEEE Trans. Speech and Audio Processing, vol. 6, no 5, pp. 445-455”, September 1998, which is hereby incorporated by reference in its entirety).
In the method of H. Sameti et al. noise gain adaptation is performed in speech pauses longer than 100 ms. As the adaptation is only performed in longer speech pauses, the method is not capable of reacting to fast changes in the noise energy during speech activity. A block diagram of a noise adaptation method is disclosed (in FIG. 5 of the reference), said block diagram comprising a number of hidden Markov models (HMMs). The number of HMMs is fixed, and each of them is trained off-line, i.e. trained in an initial training phase, for different noise types. The method can, thus, only successfully cope with noise level variations as well as different noise types as long as the corrupting noise has been modeled during the training process.
A further drawback of this method is that the gain in this document is defined as energy mismatch compensation between the model and the realizations, therefore, no separation of the acoustical properties of noise (e.g., spectral shape) and the noise energy (e.g., loudness of the sound) is made. Since the noise energy is part of the model, and is fixed for each HMM state, relatively large numbers of states are required to improve the modeling of the energy variations. Further, this method can not successfully cope with noise types, which have not been modeled during the training process. . .
In yet another document by Sriam Srinivasan et al., “Codebook-based Bayesian speech enhancement”, in Proc. IEEE Int. Conf Acoustic, Speech and Signal Processing, vol. 1, March 2005, pp 1077-1080, which hereby is incorporated by reference in its entirety, codebooks are used.
In the codebook-based method, the spectral shapes of speech and noise, represented by linear prediction (LP) coefficients, are modeled in the prior speech and noise models. The noise variance and the speech variance are estimated instantaneously for each signal block, under the assumption of small modeling errors. The method estimates both speech and noise variance that is estimated for each combination of the speech and noise codebook entry. Since a large speech codebook (1024 entries in the paper) is required, this calculation would be a computationally difficult task and requires more processing power that is available in for example a state of the art hearing aid. For good performance of the codebook-based method for known noise environments it requires off-line optimized noise codebooks. For unknown environments, the method relies on a fall-back noise estimation algorithm such as the R. Martin method referred to above. The limitations of the fall-back method would, thus, also apply for the codebook based method in unknown noise environments.
It is known that the overall characteristics of general speech may to a certain extent be learned reasonably well from a (sufficiently rich) database of speech. However, noise can be very non-stationary and may vary to a large extent in real-world situations, since it can represent anything except for the speech that the listener is interested in. It will be very hard to capture all of this variation in an initial learning stage. Thus, while the two last-mentioned methods of speech enhancement perform better than the more traditional, initially mentioned methods, under non-stationary noise conditions, they are based on models trained using recorded signals, where the overall performance of these two methods naturally depends strongly on the accuracy of the models obtained during the training process. These two last-mentioned methods are, thus, apart from being computationally cumbersome, unable to perform a dynamic adaptation to changing noise characteristics, which is necessary for accurate real world speech enhancement performance.