The area of speech recognition has witnessed tremendous growth recently. It is now pervasive in everyday devices like cellular phones, tablets, laptops, as well as in newly introduced gadgets including smartwatches, smartglasses, internet ready appliances, and wearable computers. Although speech technology has improved multifold, it is still plagued with several problems, especially problems related to background noise. It is known that a speech recognition engine's accuracy increases significantly when the recognizer is presented only with the speech segment that corresponds to the desired spoken text and the neighboring noise segments are removed via a process called utterance detection. Apart from accuracy, robust utterance detection is also useful to lower the speech processing times, which is advantageous for devices and networks with resource constraints.
Several attempts to reliably detect utterances have been proposed. Most of them rely on using signal energy along with statistical thresholds, Other less commonly used approaches incorporate alternative features including zero-crossings rate, periodicity and jitter, pitch stability, spatial signal correlation, spectral entropy, cepstral features, LPC residual coefficients, modulation features, alternative energy measures, temporal power envelope, spectral divergence, and other time-frequency parameters. Some references include (a) J.-C. Junqua, B. Mak, and B. Reaves, “A robust algorithm for word boundary detection in the presence of noise,” IEEE Trans. on Speech and Audio Processing, 2(3):406-412, July 1994; (b) L. Lamels, L. Rabiner, A. Rosenberg, and J. Wilpon, “An improved endpoint detector for isolated word recognition,” IEEE ASSP Mag., 29:777-785, 1981; (c) G. Evangelopoulos and P. Maragos, “Speech event detection using multiband modulation energy,” in Ninth European Conference on Speech Communication and Technology, 2005; (d) J. Wu and X. Zhang, “Maximum margin clustering based statistical VAD with multiple observation compound feature,” Signal Processing Letters, IEEE, no. 99, pp. 1-1, 2011; and (e) K. Mehta, C.-K. Pham, and E.-S. Chng, “Linear dynamic models for voice activity detection,” in Conference of the International Speech Communication Association, 2011.
Several other systems have attempted to address the utterance detection problem. In U.S. Pat. No. 6,980,950, “Automatic utterance detector with high noise immunity” the inventors propose an utterance detector for speech recognition consisting of two components. The first part makes a speech/non-speech decision for each incoming speech frame using power spectral analysis. The second component makes utterance detection decisions using a state machine that describes the detection process in terms of the speech/non-speech decision made by the first component. In U.S. Pat. No. 6,496,799, “End-of-utterance determination for voice processing”, the inventors propose a method that uses semantic and/or prosodic properties of the spoken input to determine whether or not the user input has effectively completed speaking. In U.S. Pat. No. 4,032,710, “Word boundary detector for speech recognition equipment,” an apparatus is described that receives acoustic input including words spoken in isolation and finds the word boundary instants at which a word begins and ends.
Unfortunately the problem of reliable and robust utterance detection continues to lack the desired accuracy.
To overcome some of the problems associated with automatic utterance detection, systems have been proposed that implement intelligent ways to design user interfaces. For example, recently mobile interfaces that have a “push button to speak” followed by a “push button to stop” have been implemented. However, these solutions experience problems when a user forgets to synchronize their speaking with the push buttons and speaks before pushing the start button or continues to speak after pressing the stop button. In any event, there are several applications wherein this type of user interface needs to be relaxed. For instance, some applications only have a button to speak but do not have a button to stop. Other applications implement an always listening mode that lacks manual buttons to start/stop the audio.
Having robust utterance detection for applications is important. Until now, a solution that provides a robust utterance detection method for any type of application has eluded those skilled in the art.