Speech recognizers receive a speech signal input and generate a recognition result indicative of the speech contained in the speech signal. Speech synthesizers receive data indicative of a speech signal, and synthesize speech based on the data. Both of these speech related systems can encounter difficulty when the speech signal is corrupted by noise. Therefore, some current work has been done to remove noise from a speech signal.
In order to remove additive noise from a speech signal, many speech enhancement algorithms first make an estimate of the spectral properties of the noise in the signal. One current method by which this is done is to first segment the noisy speech signal into non-overlapping segments that are either speech segments, which contain voiced speech, or non-speech segments, which do not contain voiced speech. Then, only the non-speech segments are used to estimate the spectral properties of the noise present in the signal.
This type of system, however, has several drawbacks. One drawback is that a speech detection algorithm must be used to identify those segments which contain speech and distinguish them from those segments which do not contain speech. This speech detection algorithm usually requires a model of additive noise, which makes the noise estimate problem somewhat circular. That is, in order to distinguish speech segments from non-speech segments, a noise model is required. However, in order to derive a noise model, the signal must be divided into speech segments and non-speech segments. Another drawback is that if the quality of the noise changes during the speech segments, that noise will be entirely missed in the model. Therefore, this type of noise modeling technique is generally only applicable to stationary noises, that have spectral properties that do not change over time.
Another current way to develop a noise model is to also develop a model that reflects how speech and noise change over time, and then to do simultaneous estimation of speech and noise. This can work fairly well when the spectral character of the noise is different from speech, and also when it changes slowly over time. However, this type of system is very computationally expensive to implement and requires a model for the evolution of noise over time. When the noise does not correspond closely to the model, or when the model is inaccurately estimated, this type of speech enhancement fails.
Other, current models that are used in speech tasks perform pitch tracking. These types of models track the pitch in a speech signal and use the pitch to enhance speech. These current pitch-based enhancement algorithms use discrete Fourier transforms. The speech signal is broken into contiguous over-lapping speech segments of approximately 25 millisecond duration. Frequency analysis is then performed on these over-lapping segments to obtain a pitch value corresponding to each segment (or frame). More specifically, these types of algorithms locate peaks in the pitch identified in the 25 millisecond frames. The speech signal will generally have peaks at the primary frequency and harmonics for the speech signal. These types of pitch-based speech enhancement algorithms then select the portions of the noisy speech signal that correspond to the peaks in pitch and use those portions as the speech signal.
However, these types of algorithms suffer from disadvantages as well. For instance, there can be added noise at the peaks which will not be removed from the speech signal. Therefore, the speech signal will still be noisy. In addition, the pitch of the speech is not constant, even over the 25 millisecond analysis frame. In fact, the pitch of the speech signal can vary by several percentage points in that time. Because the speech signal does not contain a constant pitch over the analysis frame, the peaks in the pitch are not sharp, but instead are relatively broad. This leads to a reduction in resolution achieved by the pitch tracker.
The discussion above is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.