In a speech recognition system, a device commonly known as an “endpoint detector” separates the speech segment(s) of an utterance represented in an input signal from the non-speech segments, i.e., it identifies the “endpoints” of speech. An “endpoint” of speech can be either the beginning of speech after a period of non-speech or the ending of speech before a period of non-speech. An endpoint detector may be either hardware-based or software-based, or both. Because endpoint detection generally occurs early in the speech recognition process, the accuracy of the endpoint detector is crucial to the performance of the overall speech recognition system. Accurate endpoint detection will facilitate accurate recognition results, while poor endpoint detection will often cause poor recognition results.
Some conventional endpoint detectors operate using log energy and/or spectral information as knowledge sources. For example, by comparing the log energy of the input speech signal against a threshold energy level, an endpoint can be identified. An end-of-utterance can be identified, for example, if the log energy drops below the threshold level after having exceeded the threshold level for some specified length of time. However, this approach does not take into consideration many of the characteristics of human speech. As a result, this approach is only a rough approximation, such that purely energy-based endpoint detectors are not as accurate as desired.
One problem associated with endpoint detection is distinguishing between a mid-utterance pause and the end of an utterance. In making this determination, there is generally an inherent trade-off between achieving short latency and detecting the entire utterance.