This invention relates to a system and method for detecting speech in a signal containing both speech and noise and for removing noise from the signal.
In communication systems it is often desirable to reduce the amount of background noise in a speech signal. For example, one situation that may require background noise removal is a telephone signal from a mobile telephone. Background noise reduction makes the voice signal more pleasant for a listener and improves the outcome of coding or compressing the speech.
Various methods for reducing noise have been invented but the most effective methods are those which operate on the signal spectrum. Early attempts to reduce background noise included applying automatic gain to signal subbands such as disclosed by U.S. Pat. No. 3,803,357 to Sacks. This patent presented an efficient way of reducing stationary background noise in a signal via spectral subtraction. See also, xe2x80x9cSuppression of Acoustic Noise in Speech Using Spectral Subtraction,xe2x80x9d IEEE Transactions On Acoustics, Speech and Signal Processing, pp. 1391-1394, 1996.
Spectral subtraction involves estimating the power or magnitude spectrum of the background noise and subtracting that from the power or magnitude spectrum of the contaminated signal. The background noise is usually estimated during noise only sections of the signal. This approach is fairly effective at removing background noise but the remaining speech tends to have annoying artifacts, which are often referred to as xe2x80x9cmusical noise.xe2x80x9d Musical noise consists of brief tones occurring at random frequencies and is the result of isolated noise spectral components that are not completely removed after subtraction. One method of reducing musical noise is to subtract some multiple of the noise spectral magnitude (this is referred to as spectral oversubtraction). Spectral oversubtraction reduces the residual noise components but also removes excessive amounts of the speech spectral components resulting in speech that sounds hollow or muted.
A related method for background noise reduction is to estimate the optimal gain to be applied to each spectral component based on a Wiener or Kalman filter approach. The Wiener and Kalman filters attempt to minimize the expected error in the time signal. The Kalman filter requires knowledge of the type of noise to be removed and, therefore, it is not very appropriate for use where the noise characteristics are unknown and may vary.
The Wiener filter is calculated from an estimate of the speech spectrum as well as the noise spectrum. A common method of estimating the speech spectrum is via spectral subtraction. However, this causes the Wiener filter to produce some of the same artifacts evidenced in spectral subtraction-based noise reduction.
The musical or flutter noise problem was addressed by McAulay and Malpass (1980) by smoothing the gain of the filter over time. See, xe2x80x9cSpeech Enhancement Using a Soft-Decision Noise Suppression Filterxe2x80x9d, IEEE Transactions on Acoustics, Speech, and Signal Processing 28(2): 137-145. However, if the gain is smoothed enough to eliminate most of the musical noise, the voice signal is also adversely affected.
Other methods of calculating an xe2x80x9coptimal gainxe2x80x9d include minimizing expected error in the spectral components. For example, Ephraim and Malah (1985) achieve good results which are free from musical noise artifacts by minimizing the mean-square error in the short-time spectral components. See, xe2x80x9cSpeech Enhancement Using a Minimum Mean-Square Error Log-Spectral Amplitude Estimatorxe2x80x9d, IEEE Transactions on Acoustics, Speech, and Signal Processing ASSP-33 (2): 443-445. However, their approach is much more computationally intensive than the Wiener filter or spectral subtraction methods. Derivative methods have also been developed which use look-up tables or approximation functions to perform similar noise reduction but with reduced complexity. These methods are disclosed in U.S. Pat. Nos. 5,012,519 and 5,768,473.
Also known is an auditory masking-based technique for reducing background signal noise, described by Virag (1995) and Tsoukalas, Mourjopoulos and Kokkinakis (1997). See, xe2x80x9cSpeech Enhancement Based On Masking Properties Of The Auditory System,xe2x80x9d Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Vol. 1, pp. 796-799; and xe2x80x9cSpeech Enhancement Based On Audible Noise Suppressionxe2x80x9d, IEEE Transactions on Speech and Audio Processing 5(6): 497-514. That technique requires excessive computation capacity and they do not produce the desired amount of noise reduction.
Other methods for noise reduction include estimating the spectral magnitude of speech components probabilistically as used in U.S. Pat. Nos. 5,668,927 and 5,577,161. These methods also require computations that are not performed very efficiently on low-cost digital signal processors.
Another aspect of the background noise reduction problem is determining when the signal contains only background noise and when speech is present. Speech detectors, often called voice activity detectors (VADs), are needed to aid in the estimation of the noise characteristics. VADs typically use many different measures to determine the likelihood of the presence of speech. Some of these measures include: signal amplitude, short-term signal energy, zero crossing count, signal to noise ratio (SNR), or SNR in spectral subbands. These measures may be smoothed and weighted in the speech detection process. The VAD decision may also be smoothed and modified to, for example, hang on for a short time after the cessation of speech.
U.S. Pat. No. 4,672,669 discloses the use of signal energy that is compared to various thresholds to determine the presence of voice. In U.S. Pat. No. 5,459,814 a voice detector is disclosed with multiple thresholds and multiple measures are used to provide a more accurate VAD decision. However, since speech levels and characteristics and background noise levels and characteristics change, a system with some intelligent control over the levels and VAD decision process is needed. One approach that tailors the VAD smoothing to known speech characteristics is disclosed in U.S. Pat. No. 4,357,491. However, this system is based on processing a signal""s time samples; therefore, it does not make use of the unique frequency characteristics which distinguish speech from noise.
In summary, there are methods for reducing noise in speech which are efficient and simple but which produce excessive artifacts. There are also methods which do not produce the musical artifacts but which are computationally intensive. What is needed is an efficient, low-delay method detecting when speech or voice is present in a signal.
The present invention is directed to a speech or voice activity detector (VAD) for detecting whether speech signals are present in individual time frames of an input signal. The VAD comprises a speech detector that receives as input the input signal, examines the input signal in order to generate a plurality of statistics that represent characteristics indicative of the presence or absence of speech in a time frame of the input signal, and generates an output based on the plurality of statistics representing a likelihood of speech presence in a current time frame. The VAD comprises a state machine coupled to the speech detector that has a plurality of states. The state machine receives as input the output of the speech detector and transitions between the plurality of states based on a state at a previous time frame and the output of the speech detector for the current time frame. The state machine generates as output a speech activity status signal based on the state of the state machine, which provides a measure of the likelihood of speech being present during the current time frame. The VAD is useful in a noise reduction system to remove or reduce noise from a signal containing speech (or a related information carrying signal) and noise.
The above and other objects and advantages of the present invention will become more readily apparent when reference is made to the following description taken in conjunction with the accompanying drawings.