Most prior art automatic speech recognition (ASR) systems generally have little difficulty in generating recognition hypotheses for long segments of a continuously recorded audio signal containing speech. When the signal is recorded in a controlled, quiet environment, the hypotheses generated by decoding long segments of the audio signal are almost as good as those generated by selectively decoding only those segments that contain speech. This is mainly because when the audio signal is acoustically clean, silence is easily recognized as such and is clearly distinguishable from speech. However, when the signal is noisy, known ASR systems have difficulties in clearly discerning whether a given segment in the audio signal is speech or noise. Often, spurious speech is recognized in noisy segments where there is no speech at all.
Speech Segmentation
This problem can be avoided if the beginning and ending boundaries of segments of the audio signal containing speech are identified prior to recognition, and recognition is performed only within these boundaries. The process of identifying these boundaries is commonly referred to as endpoint detection, or speech segmentation. A number of speech segmentation methods are known. These can be roughly categorized as rule-based methods and classifier-based methods.
Rule-Based Segmentation
Rule-based methods use heuristically derived rules relating to some measurable properties of the audio signal to discriminate between speech and non-speech segments. The most commonly used property is the variation in the energy in the signal. Rules based on energy are usually supplemented by other information such as durations of speech and non-speech events, see Lamel, L., Rabiner, L. R., Rosenberg, A., and Wilpon, J., “An improved endpoint detector for isolated word recognition,” IEEE ASSP magazine, Vol. 29, 777-785, 1981, zero crossings, Rabiner, L. R. and Sambur, M. R., “An algorithm for determining the endpoints of isolated utterances,” Bell Syst. Tech. J., Vol. 54, No. 2, 297-315, 1975, pitch Hamada, M., Takizawa, Y. Norimatsu, T., “A noise-robust speech recognition system,” Proceedings of the International conference on speech and language processing ICSLP90, pp. 893-896, 1990.
Other notable methods in this category use time-frequency information to locate segments of the signal that can be reliably tagged and then expanded to adjacent segments, Junqua, J.-C., Mak, B., and Reaves, B., “A robust algorithm for word boundary detection in the presence of noise,” IEEE trans. on Speech and Audio Proc., Vol. 2, No. 3, 406-412, 1994.
Classifier-Based Segmentation
Classifier-based methods model speech and non-speech events as separate classes and treat the problem of speech segmentation as one of classification. The distributions of classes may be modeled by static distributions, such as Gaussian mixtures, Hain, T., and Woodland, P. C., “Segmentation and classification of broadcast news audio,” Proceedings of the International conference on speech and language processing ICSLP98, pp. 2727-2730, 1998, or the models can use dynamic structures such as hidden Markov models, Acero, A., Crespo, C., De la Torre, C., and Torrecilla, J. C., “Robust HMM-based endpoint detector,” Proceedings of Eurospeech'93, pp. 1551-1554, 1993. More sophisticated versions use the speech recognizer itself as an endpoint detector.
Generally, these methods use a priori information about the signal, as stored by the classifier, for endpointing. Hence, these methods are not well-suited for real-time implementations. Some endpointing methods do not clearly belong to either of the two categories, e.g., some methods use only the local variations in the statistical properties of the incoming signal to detect endpoints, Siegler, M., Jain, U., Raj, B., and Stern, R. M., “Automatic segmentation, classification and clustering of broadcast news audio,” Proceedings of the DARPA speech recognition workshop February 1997, pp. 97-99, 1997.
Rule-based segmentation has two main problems. First, the rules are specific to the feature set used for endpoint detection, and new rules must be generated for every new feature considered. Due to this problem, only a small set of features for which rules are easily derived is commonly used. Second, the parameters of the applied rules must be fine tuned to the specific acoustic conditions of the signal, and do not easily generalize to other recording conditions.
Classifier-based segmenters, on the other hand, use feature representations of the entire spectrum of the signal for endpoint detection. Because classifier-based methods use more information, they can be expected to perform better than rule-based segmenters. However, they also have problems. Classifier-based segmenters are specific to the kind of recording environments for which they are trained. For example, classifiers trained on clean speech perform poorly on noisy speech, and vice versa. Therefore, classifiers must be adapted to a specific recording environments, and thus, are not well suited for any recording condition.
Because feature representations usually have many dimensions, typically 12-40 dimensions, adaptation of classifier parameters requires relatively large amounts of data. Even then, large improvements in speech and non-speech segmentation is not always observed, see Hain et al, above.
Moreover, when adaptation is to be performed, the segmentation process becomes slower and more complex. This can increase the time lag or latency between the time at which endpoints occur and the time at which they are detected, which may affect real-time implementations. When classes are modeled by dynamic structures such as HMMs, the decoding strategies used can introduce further latencies, e.g., see Viterbi, A. J., “Error bounds for convolutional codes and an asymptotically optimum decoding algorithm,” IEEE Trans. on Information theory, 260-269, 1967.
Recognizer-based endpoint detection involves even greater latency because a single pass of recognition rarely results in good segmentation and must be refined by additional passes after adapting the acoustic models used by the recognizer. The problems of high dimensionality and higher latency make classifier-based segmentation less effective for most real-time implementations. Consequently, classifier-based segmentation is mainly used in off-line or batch-mode implementations.
Therefore, there is a need for a speech segmentation method that can be applied, in batch-mode and real-time, to a continuous audio signal recorded under varying acoustic conditions.