The invention relates to automatic speech recognition (ASR) systems and techniques and more particularly to automatic speech recognition systems and techniques that allow listeners to interactively barge-in and interrupt the interactive messages of such systems.
Because of the widespread use of echo cancellation in speech recognition systems (see U.S. Pat. No. 4,914,692), most ASR systems now allow users to interrupt a prompt and provide speech input at an earlier time. Instead of waiting for an ASR recorded or synthesized audio prompt to finish, it is very desirable that the audio prompt be disabled once the ASR system recognizes that the user has begun speaking in response to the current audio prompt, since it is annoying and confusing to the user to have the prompt continue. However, it is also annoying to the user if the audio prompt is disabled in response to an inadvertent cough, breath, clearing of one""s throat or other non-vocabulary input.
A known ASR system and method in this area is described in U.S. Pat. No. 5,155,760. This known ASR system and method uses an energy detector as part of a speech detector to determine the onset of speech to disable the prompt. This system and method has the drawback of not being immune to inadvertent out-of-vocabulary input and is susceptible to falsely turning off the prompt.
In U.S. Pat. No. 5,956,675 issued to A. Setlur and R. Sukkar an ASR method was described for smart barge-in detection in the context of connected word recognition. That patent described a method and apparatus for detecting barge-in using a system that used the beam search framework. Barge-In was declared as soon as all viable speech recognition paths in the decoding network had a word other than silence or garbage associated with them. It operated at the word level and shut off the prompt after the first content word (a contentless word is for example silence, coughing or clearing of throat) was detected. While this method described in U.S. Pat. No. 5,956,675 works well for connected digits and short words and is immune to inadvertent out-of-vocabulary speech, it may be impractical for longer duration words since it would take much longer for the prompt to be turned off.
Hence there is a need for xe2x80x9csmartxe2x80x9d barge-in detection for more general tasks wherein the ASR system detects the onset of valid speech input before disabling the audio prompt, yet xe2x80x9csmartxe2x80x9d enough to ignore contentless sound energy.
Briefly stated, the aforementioned problems are overcome and a technological advance is made by providing the problem of early determination of onset of valid spoken input by examining sub-word units in a decoding tree. The present invention lends itself well to a wider range of speech recognition tasks since it operates at the sub-word level and does not suffer from the drawback mentioned above of not working effectively on longer duration words. Additionally, the present invention is more efficient in CPU utilization compared with previous systems, since it examines only the best scoring path instead of all viable paths of the decoding network.
In accordance with one embodiment of the invention, the aforementioned problem is solved by providing an ASR method which has the steps of: a. determining if a speech utterance has started, if an utterance has not started then obtaining next frame and re-running step a, otherwise continuing to step b; b. obtaining a speech frame of the speech utterance that represents a frame period that is next in time; c. extracting features from the speech frame; d. computing likelihood scores for all active sub-word models for the present frame of speech; e. performing dynamic programming to build a speech recognition network of likely sub-word paths; f. performing a beam search using the speech recognition network; g. updating a decoding tree of the speech utterance after the beam search; h. finding the best scoring sub-word path of said likely sub-word paths and determining a number of sub-words in said best scoring sub-word path; i. determining if said best scoring sub-word path has a sub-word length greater than a minimum number of sub-words and if the best scoring path is greater proceeding to step j, otherwise returning to step b; j. determining if recorded root is a sub-string of best path and if recorded root is not a sub-string of best path recording best path as recorded root and returning to step b, otherwise proceeding to step k; k. determining if the recorded root has remained stable for a threshold number of additional sub-words and if said root of said best scoring path has not remained stable for the threshold number returning to step b otherwise proceeding to step 1; l. declaring barge-in; m. disabling any prompt that is playing; and n. backtracking through the best scoring path to obtain a string having a greatest likelihood of corresponding to the utterance; and outputting the string. This embodiment can further have in parallel with step i, a second branch of steps including the steps of: determining if a number of sub-words in said best path exceeds a maximum number of sub-words, and if said maximum number has been exceeded proceeding to step 1 and if said maximum number has not been exceeded returning to step b. Alternatively this embodiment can further have in parallel with step i, a third branch of steps including the step of determining if a speech endpoint has been reached, if yes said speech endpoint has been reached then begin backtracking to obtain recognized string and declaring barge-in and proceeding to step m, and if no said speech endpoint has not been reached then proceeding to step b. Yet a further embodiment can have both second and third branches of steps in parallel with step i.
In another embodiment of the invention, the aforementioned problem is overcome by providing an automatic speech recognition system supporting barge-in that operates on the sub-word level.