The accuracy of existing speech recognition systems is often adversely impacted by an inability to obtain a complete speech signal for processing. For example, imperfect synchronization between a user's actual speech signal and the times at which the user commands the speech recognition system to listen for the speech signal can cause an incomplete speech signal to be provided for processing. For instance, a user may begin speaking before he provides the command to process his speech (e.g., by pressing a button), or he may terminate the processing command before he is finished uttering the speech signal to be processed (e.g., by releasing or pressing a button). If the speech recognition system does not “hear” the user's entire utterance, the results that the speech recognition system subsequently produces will not be as accurate as otherwise possible. In open-microphone applications, audio gaps between two utterances (e.g., due to latency or others factors) can also produce incomplete results if an utterance is started during the audio gap.
Poor endpointing (e.g., determining the start and the end of speech in an audio signal) can also cause incomplete or inaccurate results to be produced. Good endpointing increases the accuracy of speech recognition results and reduces speech recognition system response time by eliminating background noise, silence, and other non-speech sounds (e.g., breathing, coughing, and the like) from the audio signal prior to processing. By contrast, poor endpointing may produce more flawed speech recognition results or may require the consumption of additional computational resources in order to process a speech signal containing extraneous information. Efficient and reliable endpointing is therefore extremely important in speech recognition applications.
Conventional endpointing methods typically use short-time energy or spectral energy features (possibly augmented with other features such as zero-crossing rate, pitch, or duration information) in order to determine the start and the end of speech in a given audio signal. However, such features become less reliable under conditions of actual use (e.g., noisy real-world situations), and some users elect to disable endpointing capabilities in such situations because they contribute more to recognition error than to recognition accuracy.
Thus, there is a need in the art for a method and apparatus for obtaining complete speech signals for speech recognition applications.