The present invention relates generally to speech recognition. In particular, it relates to speech recognition methods and apparatuses that delimit speech in noisy environments.
The automatic recognition of human speech in arbitrary environments is a difficult task. The problem is yet more difficult when the recognition is to be performed in real time, i.e., the delay between the end of speech and the system response is no more than the speaker might expect in a typical human conversation.
One of the key components of a real time speech recognition system is the ability to reliably detect the start and end of speech. While the best way to do this would involve a feedback path from the speech recognizer itself, it is not feasible to do this in real time using current technology. Because feedback is not a viable option, there is a need for methods and apparatus to determine the start and end of speech in a computationally efficient manner.
Endpointing is one technique that delimits the start and end of speech. Endpointing is difficult, however, when speech is acquired over a telephone network because of system noise. Additionally, the variety of modes and environments in which conventional as well as cellular, cordless, and hands-free telecommunications devices are used all add to the challenge.
The key difficulty in any telecommunication system is the background noise of a telephone call. The background noise can be due to any number of phenomena, including cars, crowds, music, and other speakers. Moreover, the intensity of this background noise can be constantly changing and is impossible to predict accurately.
Currently, telephone-network real-time speech recognition system endpointers are based primarily on the energy in the received signal, which includes the speech and the background noise. They may also use other statistics derived from the received signal including zero-crossings, for more information on zero-crossing see U.S. Pat. No. 5,598,466, issued to David L. Graumann on Jan. 28, 1997, or energy variance, for more information on energy variance see U.S. Pat. No. 5,323,337, issued to Denis L. Wilson et al. on Jun. 21, 1994. The endpointer statistic is fed to a finite state machine, which signals the start and end of speech on the basis of a number of thresholds and timeouts. An example of how such a state machine operates is given in FIG. 1.
FIG. 1 is a flow chart showing the operation of a finite state machine. First, the finite state machine receives an endpointer statistic (step 102). Next, the state machine determines whether the current statistic exceeds a first threshold for a first predetermined amount of time (a first timeout) (step 104). If the determination is negative, steps 102 and 104 are repeated. If the determination is positive, the state machine identifies the beginning of speech (step 106). The state machine then enters the in speech state (step 108). While in the in speech state, the state machine determines whether the statistic falls below a second threshold for a second predetermined amount of time (step 110). If the determination is negative, steps 108 and 110 are repeated. If the determination is positive, the state machine enters a tentative silence state (step 112). During the tentative silence state, the state machine determines whether statistic exceeds the first threshold for the first predetermined amount of time. If the determination is positive, the state machine returns to the in speech state, step 108. If the determination is negative, the finite state machine determines whether the statistic has remained below the first threshold for a third predetermined amount of time (step 116). If the determination is negative, steps 112 to 116 are repeated. Finally, if the determination is positive the state machine identifies the end of speech (step 118). Thus, the speech recognition system performs recognition on only that portion of the input signal between the beginning of speech and the end of speech (i.e., while the state machine is in the in speech state).
Typically, the effectiveness of an endpointer decreases as the intensity of the background noise increases. Loud background noise may cause the endpointer to signal a start of speech too soon or delay the detection of the end of speech. The latter condition can be quite damaging to the performance of a real time speech recognition system. Clearly, the endpointer requires some adaptation to compensate for the background. Therefore, it would be desirable to provide an endpointer that pre-processes the inputted signal in real time so that foreground speech delimitation using a fixed threshold endpointing method is less susceptible to background noise.