A speech recognition system recognizes a collection of spoken words (“speech”) into recognized phrases or sentences. A spoken word typically includes one or more phones or phonemes, which are distinct sounds of a spoken word. Thus, to recognize speech, a speech recognition system must not only employ the acoustic nature of the speech, but must also utilize the relationships between the words in the speech. A common way of representing the acoustic nature of the speech is by using an acoustical model such as a hidden markov model (HMM). A common way of representing the relationships between words in recognizing speech is by using a language model.
Typically, a HMM uses a series of transitions from state to state to model a letter, a word, or a sentence. Each state can be referred to a phone or a triphone (phone having a left and right context). Each arc of the transitions has an associated probability, which gives the probability of the transition from one state to the next at an end of an observation frame. As such, an unknown speech signal can be represented by ordered states with a given probability. Moreover, words in an unknown speech signal can be recognized by using the ordered states of the HMM.
To improve speech recognition accuracy, language models can be used in conjunction with an acoustical model such as the HMM. A language model specifies the probability of speaking different word sequences. Examples of language models are N-gram probabilistic grammars, formal grammars, and N-grams of word classes. Typically, the models are represented as a phonetic tree structure wherein all of the words that are likely to be encountered reside at the ends of branches or at nodes in the tree. Each branch represents a phoneme. All the words that share the same first phoneme share the same first branch, and all words that share the same first and second phonemes share the same first and second branches, and so on.
A common type of phonetic tree is a triphone tree. A triphone is a phoneme having a left phoneme context and right phoneme context. Thus, a triphone assumes that the way a phoneme is pronounced depends on its neighboring phoneme contexts. To recognize large continuous speech, a first word can be recognized using a triphone tree. Typically, to recognize a second word while maintaining a relationship with the first word, the triphone tree related to the first word must be copied. Hence, a disadvantage of the triphone tree copying technique is the redundant copying of triphone trees to recognize words thereby causing speech recognition process to be slow and large amounts of memory to be consumed.
Another technique uses a single triphone tree fast match algorithm to avoid copying of triphone trees. This technique is typically used as a forward pass of a forward/backward recognition process. That is, a second pass must be employed for obtaining higher speech recognition accuracy. In particular, the first pass serves only as a fast match operation and its purpose is to keep the likely word endings in guiding the second pass. Thus, a disadvantage of this technique is that it must be followed by an additional second pass beam search process that requires additional processing time and resource use.