1. Field of the Invention
The present invention relates to speech recognition and, more particularly, to a speech recognition decoding enhancement that conserves memory and CPU (Central Processing Unit) usage when performing Viterbi based decoding against an N-gram based language model (N>2), where the language model includes embedded grammars.
2. Description of the Related Art
A Statistical Language Model (SLM) is a probabilistic description of the constraints on word order found in a given language. Most current SLMs are based on the N-gram principle, where the probability of the current word is calculated on the basis of the identities of the immediately preceding (N−1)words. Robust speech recognition solutions using a SLM use an N-gram where N is greater than two, meaning trigrams and greater are generally used. A SLM is not manually written, but is trained from a set of examples that models expected speech, where the set of examples can be referred to as a speech corpus. SLMs can produce results for a broad range of input, which can be useful for speech-to-text recognizing words, for free speech dictation and for processing input including unanticipated and extraneous elements, which are common in natural speech. One significant drawback to SLM use is that a size of a speech corpus for generating a SLM can be very large. Another is that SLM based systems are often not sufficiently accurate when confronted with utterances outside of the system's intended coverage.
In many instances, speech recognition grammars are used by speech processing engines instead of, or in addition to, using a SLM. Speech recognition grammars can be used to define what may be spoken and properly speech recognized by a speech recognition engine. Simple speech recognition grammars can specify a set of one or more words, which define valid utterances that can be properly speech recognized. More complex speech recognition grammars can specify a set of rules that are written in a grammar specification language, such as the BNF (Backus-Naur form), a Speech Recognition Grammar Specifications (SRGS) compliant language, a JAVA Speech Grammar Format (JSGF) compliant language, and the like.
Speech recognition grammars can be extremely accurate when provided with expected utterances, which are defined by an associated grammar. Unlike SLM based speech recognition, relatively little data is needed to train speech recognition grammars. Additionally, speech recognition grammars can advantageously include lists and classes that can be dynamically changed. Unfortunately, speech recognition grammars tend to fail completely for unexpected inputs.
Hybrid models attempt to combine the robustness of SLMs with the semantic advantages of speech recognition grammars. A hybrid speech recognition system (conforming to a hybrid model) can use a general SLM that includes one or more encapsulated grammars, referred to herein as embedded grammars (EGs). Decoded speech can contain words from the SLM and sequences of words from the respective EGs along with attached semantics. The usage of EGs can permit the use of less training data for the SLM and can increase recognition accuracy for inputs defined by the EGs.
SLM based speech recognition systems, which include the hybrid systems, typically use a Viterbi algorithm for finding a most likely string of text given an acoustic signal, such as a speech utterance. The Viterbi algorithm operates on a state machine assumption. That is, there are a finite number of states, which are each represented as a node. Multiple sequences of states, called paths, can lead to any given state. The Viterbi algorithm examines all possible paths leading to a state and keeps only the most likely path.
When using EGs in a Viterbi search, a single EG is treated as a single word. This means that each EG has to be repeated for each of the different contexts in which it appears in a Viterbi search. The memory consumption for each EG is significant. Whenever N-grams are used where N is greater than two, a complexity of a Viterbi search space that includes EGs can be huge, due to EG repetition for each context within which the EG is used. Because of a complex search space, Viterbi searches can consume tremendous amounts of memory, which can severely degrade speech recognition performance in terms of CPU utilization. Memory is consumed in such quantities that it can be practically impossible to perform the hybrid Viterbi search using embedded speech recognition systems, which typically have limited resources. Even robust computing systems can become quickly over tasked when a search space includes multiple EGs, each having multiple usage contexts.