Impulsive interference is a process characterized by bursts of one or more short pulses whose amplitudes, durations and times of occurrences are random. Systems that process human speech signals, such as automatic speech recognition (ASR) systems, that are used in noisy environments, such as automobiles, may be subject to impulsive interferences, such as due to road bumps or wind buffets from open windows. Mobile communication devices and other microphone-based systems used in windy environments or combat zones provide other examples of systems that are subjected to impulsive interferences.
Conventional single channel noise suppression algorithms are typically able to suppress stationary, i.e., continuous, noises, such as car engine noise, because these stationary noises can be relatively easily distinguished from speech signals. However, a large class of impulsive interferences exhibits highly non-stationary characteristics, much like speech signals, and can not, therefore, be suppressed using standard single channel noise reduction algorithms. In fact, applying standard single channel noise reduction algorithms when impulsive interferences are present often reduces speech recognition performance and ease of use.
Wind noise can be particularly problematic. For example, wind noise can occur even in a quiet surrounding, such as directly within a capsule of a microphone. Thus, a user of the microphone may not even be aware of the problem and may not, therefore, compensate for the noise, such as by speaking louder. Multiple-microphone systems can, in some cases, suppress wind noise generated within one of the microphones. However, many important applications require only a single microphone and are not, therefore, susceptible to multi-microphone solutions.
Some time-domain approaches for non-stationary noise reduction exist. So-called templates or prototypes are proposed (e.g. [2], [3]) for restoring old recordings by removing transients. Vaseghi [2] proposes a method for detection that includes a matched filter for a respective template, followed by removal with an interpolator. Restoring old recordings does not, however, have to be performed in real time. Therefore, non-causal filtering can be employed in these contexts, unlike the applications contemplated above. Godsill uses a statistical approach and models signal and interference as two automatic speech recognition processes excited by two independent and identically distributed (i.i.d.) variables. In Gaussian processes [3], removal is performed by tracing the trajectory of the desired-signal component of a Kalman filter using the aforementioned models.
A more recent publication on this topic, dedicated to the removal of wind noise in particular, is [4] by King and Atlas. The proposed concept completely relies on a computationally expensive least-squares-harmonic (LSH) pitch estimate, as proposed in [5]. (“Pitch” or “pitch frequency” here means a fundamental or other single frequency component of a signal. For example, a speech signal of an uttered vowel sound contains a pitch frequency and typically several other frequencies that are harmonically related to the pitch frequency. The pitch frequency can vary between the beginning and the end of the utterance.) The mismatch of the LSH speech model, together with an energy constraint, provides evidence used for interference detection. In case of voiced speech absence, a simple high-pass at about 4 kHz is applied to cut off all wind noise. In the presence of voiced speech, the wind noise is removed by low-order comb filters applied to sub-band signals that have been demodulated to base band. Afterwards, segments of voiced speech are re-synthesized. If a sufficiently good estimate of the fundamental frequency (pitch) is available, comb filtering can effectively reduce any type of broadband noise in the gaps of the harmonic speech spectrum, including wind noise. Pitch adaptive filtering for speech enhancement is, however, a well-known means [1]. As a matter of fact, getting an accurate and robust pitch estimate from noisy speech signals is a difficult task in practice.
In 2009 Nemer and Leblanc (Broadcom Corp.) proposed detecting wind noises based on linear prediction [7]. They observed that wind may be well modeled using a low order predictor, since there is no harmonic structure to it. For speech, however, a higher predictor order is necessary. This can be used for distinguishing speech from wind noise, hence a suppression filter can be designed. See, for example, Pat. Publ. No. US 2010/0223054.
Kotta Manohar, et al., discuss a post-processing scheme to be applied to short-time spectral attenuation (STSA) speech enhancement algorithms in “Speech enhancement in nonstationary noise environments using noise properties,” published by Elsevier in Speech Communication 48 (2006) 96-109.
T. A. Mahmound, et al., describe an edge-guided morphological filter to sharpen digital images in “Edge-Detected Guided Morphological Filter for Image Sharpening,” published by Hindawi Publishing Corporation in EURASIP Journal on Image and Video Processing, Volume 2008, Article ID 970353.
Petros Maragos discusses morphological filtering for image enhancement and feature detection in chapter 3.3 of a book titled “The Image and Video Processing Handbook,” 2d edition, edited by A. C. Bovik, published by Elsevier Academic Press, 2005, pp. 135-156.
Hetherington, et al., propose another approach for wind buffet suppression, which is available from Wavemakers division of QNX Sofware Systems GmbH & Co. KG, a subsidiary of Research In Motion Ltd. See, for example, U.S. Pat. No. 7,895,036, U.S. Pat. No. 7,885,420, Pat. Publ. No. US 2011/0026734 and Pat. Publ. No. EP 1 450 354 B1. The core idea of their approach is a rather simple spectral model for wind. In particular, the wind model constitutes a straight line in a log-spectrum with a negative slope at low frequencies, up to the point where the spectral energy is dominated by background noise. Various similarity measures between the model and a signal frame are used to classify the input frame as wind, wind and speech or wind only. Furthermore, the model enables using the model's spectral shape for noise suppression. The generation of a long-term estimate by averaging over the model's instantaneous estimates from unvoiced frames is also proposed.
Besides the utilized linear model, the pitch-frequency-dependent ripples in the signal spectrum are first detected and then protected from being suppressed by interference reduction. A practical implementation of this mechanism detects peaks in the amplitude spectrum and measures each peak's width. Spectrally narrow and temporally slowly changing peaks indicate voiced speech, whereas spectrally broad and quickly changing ones indicate wind.
Furthermore, the harmonic relationship between the peaks along the frequency axis is measured using a discrete cosine transform (DCT) [6]. This directly translates into a cepstrum-based pitch estimation, if the DCT is applied to the logarithmic spectrum. Such pitch tracking methods have been proposed in the late 1960s.
This method is thus built on the assumed knowledge of the pitch frequency, together with a simple spectral model. Signal components that have not been found to belong to the desired signal are suppressed. The suppression is implemented by means of spectral weighting in the short-time Fourier transform domain. The wind noise suppression may, therefore, be used in conjunction with regular noise reduction.
Unfortunately, these prior art methods for reducing impulsive interferences suffer from one or more disadvantages. For example, the methods described by Hetherington require considering pitch of the speech signal in some way.