The present invention relates generally to the field of automatic speech recognition, and more particularly to a method and apparatus for locating speech within a speech signal (i.e., xe2x80x9cendpoint detectionxe2x80x9d).
When performing automatic speech recognition (ASR) on an input signal, it must be assumed that the signal may contain not only speech, but also periods of silence and/or background noise. The detection of the presence of speech embedded in a signal which may also contain various types of non-speech events such as background noise is referred to as xe2x80x9cendpoint detectionxe2x80x9d (or, alternatively, speech detection or voice activity detection). In particular, if both the beginning point and the ending point of the actual speech (jointly referred to as the speech xe2x80x9cendpointsxe2x80x9d) can be determined, the ASR process may be performed more efficiently and more accurately. For purposes of continuous-time ASR, endpoint detection must be correspondingly performed as a continuous-time process which necessitates a relatively short time delay.
On the other hand, batch-mode endpoint detection is a one-time process which may be advantageously used, for example, on recorded data, and has been advantageously applied to the problem of speaker verification. One approach to batch-mode endpoint detection is described in xe2x80x9cA Matched Filter Approach to Endpoint Detection for Robust Speaker Verification,xe2x80x9d by Q. Li et al., IEEE Workshop of Automatic Identification, October 1999.
As is well known to those skilled in the art, accurate endpoint detection is crucial to the ASR process because it can dramatically affect a system""s performance in terms of recognition accuracy and speed for a number of reasons. First, cepstral mean subtraction (CMS), a popular algorithm used in many robust speech recognition systems and fully familiar to those of ordinary skill in the art, needs an accurate determination of the speech endpoints to ensure that its computation of mean values is accurate. Second, if silence frames (i.e., frames which do not contain any speech) can be successfully removed prior to performing speech recognition, the accumulated utterance likelihood scores will be focused exclusively on the speech portion of an utterance and not on both noise and speech. For each of these reasons, a more accurate endpoint detection has the potential to significantly increase the recognition accuracy.
In addition, it is quite difficult to model noise and silence accurately. Although such modeling has been attempted in many prior art speech recognition systems, this inherent difficulty can lead not only to less accurate recognition performance, but to quite complex system implementations as well. The need to model noise and silence can be advantageously eliminated by fully removing such frames (i.e., portions of the signal) in advance. Moreover, one can significantly reduce the required computation time by removing these non-speech frames prior to processing. This latter advantage can be crucial to the performance of embedded ASR systems, such as, for example, those which might be found in wireless phones, because the processing power of such systems are often quite limited.
For these reasons, the ability to accurately detect the speech endpoints within a signal can be invaluable in speech recognition applications. Where speech is contained in a signal which otherwise contains only silence, the endpoint detection problem is quite simple. However, common non-speech events and background noise in real-world signals complicate the endpoint detection problem considerably. For example, the endpoints of the speech are often obscured by various artifacts such as clicks, pops, heavy breathing, or dial tones. Similar types of artifacts and background noise may also be introduced by long-distance telephone transmission systems. In order to determine speech endpoints accurately, speech must be accurately distinguishable from all of these artifacts and background noise.
In recent years, as wireless, hands-free, and IP (Internet packet-based) phones have become increasingly popular, the endpoint detection problem has become even more challenging, since the signal-to-noise ratios (SNR) of these forms of communication devices are often quite a bit lower than the SNRs of traditional telephone lines and handsets. And as pointed out above, the noise can come from the backgroundxe2x80x94such as from an automobile, from room reflection, from street noise or from other people talking in the backgroundxe2x80x94or from the communication system itselfxe2x80x94such as may be introduced by data coding, transmission, and/or Internet packet loss. In each of these adverse acoustic environments, ASR performance, even for systems which work reasonably well in non-adverse acoustic environments (e.g., traditional telephone lines), often degrades dramatically due to unreliable endpoint detection.
Another problem which is related to real-time endpoint detection is real-time energy feature normalization. As is fully familiar to those of ordinary skill in the art, ASR systems typically use speech energy as the xe2x80x9cfeaturexe2x80x9d upon which recognition is based. However, this feature is usually normalized such that the largest energy level in a given utterance is close to or slightly below a known constant level (e.g., zero). Although this is a relatively simple task in batch-mode processing, it can be a difficult problem in real-time processing since it is not easy to estimate the maximal energy level in an utterance given only a short time window, especially when the acoustic environment itself is changing.
Clearly, in continuous-time ASR applications, a lookahead approach to the energy normalization problem is requiredxe2x80x94but, in any event, accurate energy normalization becomes especially difficult in adverse acoustic environments. However, it is well known that real-time energy normalization and real-time endpoint detection are actually quite related problems, since the more accurately the endpoints can be detected, the more accurately energy normalization can be performed.
The problem of endpoint detection has been studied for several decades and many heuristic approaches have been employed for use in various applications. In recent years, however, and especially as ASR has found significantly increased application in hands-free, wireless, IP phone, and other adverse environments, the problem has become more difficultxe2x80x94as pointed out above, the input speech in these situations is often characterized by a very low SNR. In these situations, therefore, conventional approaches to endpoint detection and energy normalization often fail and the ASR performance often degrades dramatically as a result.
Therefore, an improved method of real-time endpoint detection is needed, particularly for use in these adverse environments. Specifically, it would be highly desirable to devise a method of real-time endpoint detection which (a) detects speech endpoints with a high degree of accuracy and does so at various noise levels; (b) operates with a relatively low computational complexity and a relatively fast response time; and (c) may be realized with a relatively simple implementation.
In accordance with the principles of the present invention, real-time endpoint detection for use in automatic speech recognition is performed by first applying a specified filter to a selected feature of the input signal, and then evaluating the filter output with use of a state transition diagram (i.e., a finite state machine). In accordance with one illustrative embodiment of the invention, the selected feature is the one-dimensional short-term energy in the cepstral feature, and the filter may have been advantageously designed in light of several criteria in order to increase the accuracy and robustness of detection. More particularly, in accordance with the illustrative embodiment, the use of the filter advantageously identifies all possible endpoints, and the application of the state transition diagram makes the final decisions as to where the actual endpoints of the speech are likely to be. Also in accordance with the illustrative embodiment, the state transition diagram advantageously has three states and operates based on a comparison of the filter output values with a pair of thresholds. The endpoints which are detected may then be advantageously applied to the problem of energy normalization of the speech portion of the signal.