The present invention relates generally to speech detection systems, and more particularly, the invention relates to a method for detecting speech using stochastic confidence measures on frequency spectrums from a speech signal.
Speech recognition technology is now in wide use. Typically, speech recognition systems receive a time-varying speech signal representative of spoken words and phrases. These systems attempt to determine the words and phrases within the speech signal by analyzing components of the speech signal. As a first step, most speech recognition systems must first isolate those portions of the signal which convey spoken words from those non-speech portions of the signal. To this end, speech detection systems attempt to determine the beginning and ending boundaries of a word or group of words within the speech signal. Accurate and reliable determination of the beginning and ending boundaries of words or sentences poses a challenging problem, particularly when the speech signal includes background noise.
Speech detection systems generally rely on different kinds of information encapsulated in the speech signal to determine the location of an isolated word or group of words within the signal. A first group of speech detection techniques have been developed for analyzing the speech signal using time domain information of the signal. Typically, the intensity or amplitude of the speech signal is measured. Portions of the speech signal having an intensity greater than a minimum threshold are designated as being speech; whereas those portions of the speech signal having an intensity below the threshold are designated as being non-speech. Other similar techniques have been based on the detection of zero crossing rate fluctuations or the peaks and valleys inside the signal.
A second group of speech detection algorithms rely on signal information extracted out of the frequency domain. In these algorithms, the variation of the frequency spectrum is estimated and the detection is based on the frequency of this variation computed over successive frames. Alternatively, the variance of the energy in each frequency band is estimated and the detection of noise is based on when these variances go below a given threshold.
Unfortunately, these speech detection techniques have been unreliable, particularly where a variable noise component is present in the speech signal. Indeed, it has been estimated that many of the errors occurring in a typical speech recognition system are the result of an inaccurate determination of the location of the words within the speech signal. To minimize such errors, the technique for locating words within the speech signal must be capable of reliably and accurately locating the boundaries of the words. Further, the technique must be sufficiently simple and quick to allow for real time processing of the speech signal. The technique must also be capable of adapting to a variety of noise environments without any prior knowledge of the noise.
The present invention provides an accurate and reliable method for detecting speech from an input speech signal. A probabilistic approach is used to classify each frame of the speech signal as speech or non-speech. This speech detection method is based on a frequency spectrum extracted from each frame, such that the value for each frequency band is considered to be a random variable and each frame is considered to be an occurrence of these random variables. Using the frequency spectrums from a non-speech part of the speech signal, a known set of random variables is constructed. In this way, the known set of random variables is representative of the noise component of the speech signal.
Next, each unknown frame is evaluated as to whether or not it belongs to this known set of random variables. To do so, a unique random variable is formed from the set of random variables associated with the unknown frame. The unique variable is normalized with respect the known set of random variables and then classified as either speech or non-speech using the "Test of Hypothesis". Thus, each frame that belongs to the known set of random variables is classified as non-speech and each frame that does not belong to the known set of random variables is classified as speech. This method does not rely on any delayed signal.
For a more complete understanding of the invention, its objects and advantages refer to the following specification and to the accompanying drawings.