This invention relates generally to user interfaces and, more specifically, to speech detection.
In speech detection systems, energy contour of an inputted signal is a major factor when detecting the beginning and ending of speech sequences. This is because the level of the input speech data is often greater than the level of the background noise. An energy contour-based speech detection algorithm (SDA) contains noise evaluation, beginning of speech detection, and end of speech detection.
At the initial second that the system starts, it is assumed that the input signal to a SDA consists only of noise. At this point, the input signal is made equal to the input noise level. If the energy of the current signal rises above the energy of the input noise level, speech is assumed to be included in the current signal. If the energy of the current signal drops a threshold amount below the initial noise level, speech is assumed to not be occurring in the current signal.
The above process works well when the noise stays at a consistent level (i.e., white noise). However, there exist many environments where the noise is not so obliging. For example, if the environment is a vehicle, extraneous noises such as car horns, sirens, passing truck noise, etc. can be included in the input signal to be evaluated by a Speech Recognition Engine (SRE). Absent an appropriate mechanism to adjust for the extraneous noises, the SRE will process the noise as if it were speech, resulting in suboptimal speech recognition. Therefore, there exists a need for better speech detection in a noisy environment.
The present invention comprises a system, method and computer program product for performing speech detection. The method first receives a sound signal and determines if the energy value of the received sound signal is above a threshold energy value. If the energy level of the received signal is above the threshold energy value, the method determines a predictive signal of the received signal, subtracts the predictive signal from the received signal, and determines if the result of the subtraction indicates the presence of speech. If it is determined that no speech is present, the threshold energy value is set to the energy level of the present received signal. If it is determined that the result of the subtraction indicates the presence of speech, the received signal is sent to a speech recognition engine.
In accordance with further aspects of the invention, the speech recognition engine generates control system commands for controlling one or more system components. The system components are vehicle system components.
As will be readily appreciated from the foregoing summary, the invention provides an improved method for performing preprocessing of sound signals for more efficient use in subsequent speech processing.