1. Field of the Invention
The present invention relates to the field of speech recognition and more particularly, to a technique for training speaker-dependent grammars.
2. Description of the Related Art
Speech-enabled automated response systems have been increasingly utilized for interfacing with customer support systems, software applications, embedded devices, and other such equipment. The vast majority of these systems attempt to convert received input into discrete textual words, thereafter performing programmatic actions based on discrete ones of these recognized words. The recognition systems typically lack a natural language understanding (NLU) and/or a phrase-oriented recognition capability, largely due to difficulties and/or expenses associated with implementing such capabilities.
One way to implement NLU and/or phrase-oriented recognition capabilities is through statistical modeling techniques that analyze probabilities of word sequences. That is, speech utterances are speech-to-text converted into a string of words, where the string of words can include alternatives for each word in the string. The words can then be processed by a grammar engine to determine the most likely meaning for the originally provided speech utterance. This approach can involve a plethora of computationally expensive operations and a vast quantity of processed data. Accordingly, statistically analyzing word strings via a grammar engine can be difficult to perform on resource restricted devices. Further, even when sufficient computational resources exist, adequate response time for real-time processing of speech input using statistical modeling techniques can be difficult to achieve and/or may compromise recognition accuracy.
Another approach to NLU and/or phrase-oriented recognition is to perform automatic speech recognition (ASR) on a phrase-by-phrase as opposed to a word-by-word basis. This approach can be extremely useful when the grammar of recognizable phrases is context-specific and each context is associated with a quantifiable and relatively small phrase set. As the phrase set grows, however, the recognition performance of this technique can geometrically degrade.
One means through which speech scientists improve recognition grammar performance when performing phrase-based speech conversions is through the addition of weights, which can be called grammar option weights. Grammar option weights can be applied to a recognition grammar to favor selected phrases and/or phrase groupings over others depending upon input phrases. When grammar option weights are intelligently assigned to related phrase sets, searches through large grammars of phrases can be conducted in a significantly more efficient manner, thereby resulting in more acceptable performance for larger grammars.
The assigning of weights to options of a phrase-based grammar, however, has proven to be a very difficult task, requiring subjective interpretations, experimentation, and qualitative determinations by speech experts. That is, assigning grammar option weights is largely a manual, time consuming, expert intensive process. Automation attempts to the weight assignment process have not yielded acceptable results.
More specifically, the attempts to date have been oriented towards training large speaker-independent grammars using vast sets of training data. Even should such attempts succeed, the approach is inherently flawed. Optimizing a phrase recognition engine for phrases uttered by a particular person or set of people will de-optimize the same engine for phrases uttered by a different population. This is natural, as different populations speak in different fashions. Accordingly, low accuracy can inevitably result from a conventional approach, at least when used by a population of vocally diverse users.
Further, such an approach can require a vast training store be used to establish grammar option weights. The gathering and processing of large data sets can be expensive in terms of time and computing resources. Further, a grammar automatically tuned using a vast training store will remain fixed until another tuning stage occurs, which can represent a large period, thereby resulting in slowly implemented and drastic adjustments. Further, there is no guarantee that a new tuning stage will result in better performance than a previous stage, as tuning is broadly and indiscriminately applied to the grammar.