Removing additive noise from acoustic signals, such as speech signals has a number of applications in telephony, audio voice recording, and electronic voice communication. Noise is pervasive in urban environments, factories, airplanes, vehicles, and the like.
It is particularly difficult to denoise time-varying noise, which more accurately reflects real noise in the environment. Typically, non-stationary noise cancellation cannot be achieved by suppression techniques that use a static noise model. Conventional approaches such as spectral subtraction and Wiener filtering typically use static or slowly-varying noise estimates, and therefore are restricted to stationary or quasi-stationary noise.
Speech includes harmonic and non-harmonic sounds. The harmonic sounds can have different fundamental frequencies over time. Speech can have energy across a wide range of frequencies. The spectra of non-stationary noise can be similar to speech. Therefore, in a speech denoising application, where one “source” is speech and the other “source” is additive noise, the overlap between speech and noise models degrades the performance of the denoising.
Model-based speech enhancement methods, which rely on separately modeling the speech and the noise, have been shown to be powerful in many different problem settings. When the structure of the noise can be arbitrary, which is often the case in practice, model-based methods have to focus on developing good speech models, whose quality is a key to their performance.
In terms of modeling strategy, two broad approaches exist. One approach is based on discrete state modeling such as Gaussian mixture models. Another approach uses continuously-weighted combinations of basis functions, such as non-negative matrix factorizations and their extensions. The general trade-off is that discrete-state approaches can be more precise, especially in their temporal dynamics, whereas continuous approaches can be more flexible with respect to gain and subspace variability.
For example, U.S. Pat. No. 8,015,003 describes denoising a mixed signal, e.g., speech and noise signals, using a model that includes training basis matrices of a training acoustic signal and a training noise signal, and statistics of weights of the training basis matrices. In general, however, conventional methods that focus on slow-changing noise are inadequate for fast-changing nonstationary noise, such as experienced by using a microphone in a noisy environment. In addition, compensation for fast-changing additive noise requires high computational power to the degree that methods than can compensate for all possible multitude of noise and speech variations may quickly become computationally prohibitive.
Therefore, it is desired to provide a dynamic and adaptive speech enhancement method.