A continuous speech recognition system recognizes a collection of continuous spoken words (“speech”) into recognized phrases or sentences. A spoken word typically includes one or more phones or phonemes, which are distinct sounds of a spoken word. Thus, to recognize continuous speech, a speech recognition system must maintain relationships between the words in the continuous speech. A common way of maintaining relationships between words is using a word graph. A word graph includes a plurality of word nodes to form a net or lattice. Each word node represents a unit word and the net or lattice maintains the relationships between the unit words.
FIGS. 1A and 1B are prior art word graphs based on a within word acoustical model and a cross-word acoustical model (i.e., a model in which word is related to a word before and after). FIG. 1A is a prior art word graph 100 based on the within word acoustical model. Referring to FIG. 1A, the prior word graph 100 includes a plurality of word nodes 101 (“A”) through 106 (“F”) connected by edges of a lattice. The lattice maintains the relationships between words. Each word node represents one pronunciation of a word, which is typically referred to as a phonelist. Word nodes can also represent triphone lists, which are pronunciations having a right and left context. A common use of triphones is with the hidden markov models (HMM). The HMM are common models for speech recognition.
The word graph 100 can be based on a particular task, e.g., a task for describing the weather. For example, node 101 (A) can represent the word “cloudy,” nodes 102 (B) and nodes 103 (C) can represent the words “very” and “partly,” respectively, and nodes 104 (D), 105 (E), and 106 (F) can represent the words “yesterday,” “today,” and “tomorrow,” respectively. Thus, for example, graph 100 can be used to recognize continuous speech having the words “very cloudy today.” In this example, word graph 100 maintains the relationships between word node 101 (A) with word nodes 102 (B) through 106 (F) to recognize continuous speech.
A disadvantage of using word graph 100 based on the within word model is that it does not account for different variations in which a word can be pronounced. That is, pronunciations can vary, e.g., from person to person, from dialect to dialect, or from context to context. Thus, the word graph 100 is prone to a high speech recognition error rate.
FIG. 1B is a prior art word graph 150 based on the cross-word acoustical model. The word graph 150 based on the cross-word acoustical model is derived from the word graph 100 based on the within word acoustical model. The word graph 150 provides improved recognition accuracy than prior word graph 100 by accounting for co-articulation effects between words. Referring to FIG. 1B, word graph 150 includes a plurality of word nodes 101a ((B)A(D)) through 101f ((C)A(F)) and word nodes 102 (B) through 106 (F). The word nodes represented by word nodes 101a through 101f are copies of the word node A with varying left contexts and right contexts. For example, word node 101a provides a left context (B) and a right context (ID) for the word A.
A disadvantage of using word graph 150 based on the cross-word acoustical model is that it requires multiple copies of a word node. That is, word node (A) in word graph 150, is required to account for the variations for word node (A). For example, referring to word graph 150, word node 101a ((B)A(D)) refers to one copy of node A under the left context B and right context D and word node 101f refers to another coy of the node A under the left context C and right context F. If, for example, word node (A) included 5 phones, then 6 copies of word node A would be required according to its context and 30 internal phones will be generated for word A (e.g., BA1, A2, A3, A4, A5D, BA1, A2, A3, A4, A5E . . . ). As such, a word graph based on the cross-word acoustical model can consume large amounts of memory. Furthermore, computation for continuous speech recognition is increased significantly because of using such large word graphs.