The present invention generally relates to telecommunication systems and methods, as well as automatic speech recognition systems. More particularly, the present invention pertains to keyword spotting within automatic speech recognition systems.
Keyword spotting systems that are currently in use may include: phonetic search, garbage models, and large vocabulary continuous speech recognition (LVCSR). Each of these systems has inherent drawbacks which affect the accuracy and performance of the system.
In phonetic search systems, a “phonetic decoder” is relied upon which converts an audio stream into one or many possible sequences of phonemes which can be used to identify words. “John says”, for example, can be broken down into the phoneme string “jh aa n s eh s”. The phonetic decoder hypothesizes a phoneme stream for the audio. This phoneme sequence is compared to the expected phoneme sequence for a keyword and a match is found. Some systems developed with this concept have shown reasonable performance, however, there are many disadvantages for use in a real-time application. Use of a phonetic decoder prior to keyword search clearly needs to be done in two stages. This adds considerable complexity. Such a system would work well in retrieval from stored audio, where real-time processing is not required. Another disadvantage is the rate of error with phoneme recognition. The state-of-the-art speech recognizers, which incorporate complex language models, still produce accuracies in the range of 70-80%. The accuracy decreases further for conversational speech. These errors are further compounded by the phonetic search errors producing degradation in keyword spotting accuracy.
Another common technique used for keyword spotting is via the use of Garbage models that match to audio any data other than the keyword. A phoneme network is commonly used to decode non-keyword audio into a sequence of phonemes. One simple approach to implement this method is to use speech recognizers conforming to the Speech Recognition Grammar Specification (SRGS) and write a grammar as follows:$root=$GARBAGE(“keyword1”|“keyword2”)$GARBAGE;
Since most speech recognizers use phonetic decoding to implement a $GARBAGE rule, these methods have the same disadvantages of the phonetic search, especially from a resource usage standpoint. Another approach to implementation of a garbage model is to treat it as a logical hidden Markov model (HMM) state, and its emitting probability to be a function of all triphone models in the acoustic model, or estimate it iteratively. Both the approaches hinder real-time requirements as they need computation of a large number of probabilities or go through the data in multiple passes.
LVCSR systems rely completely on a LVCSR speech recognition engine to provide a word-level transcription of the audio and later perform a text based search on the transcriptions for the keyword. Considering the high computational cost of LVCSR engines, this solution is clearly infeasible for real-time keyword spotting. Furthermore, the accuracy of LVCSR systems is usually tied closely with domain knowledge. The system's vocabulary needs to either be rich enough to contain all possible keywords of interest or be very domain specific. Spotting keywords from multiple languages would mean running multiple recognizers in parallel. A more effective means to increase the efficacy of these methods is desired to make keyword spotters more pervasive in real-time speech analytics systems.