1. Field
Embodiments of the invention relate to the field of speech recognition; and more specifically, to improving the robustness to environmental changes of a speech recognizer.
2. Background
Many general purpose speech recognizers are built using a Hidden Markov Model (HMM) and process speech at a speech unit level (e.g., phone, word, function word, syllable, beginning and final syllables, etc.). A phone speech unit is typically a portion of audio (e.g., speech) of a sequence of sounds that is perceptually unique that has been decomposed from a word. For example, the phrase “I Want,” may include five distinct phones (ay, w, ao, n, and t in the TIMIT phone system). Each phone may be included in multiple features or frames (the number of which is typically dependent on the length of the phone and which is typically different for different speakers, speech rate, emotional state, etc.). The HMMs typically include multiple states to process different parts of each phone. For example, a three state HMM processes the beginning, nucleus, and the end of each phone in an initial, body, and final state respectively. Left to right HMMs are used in speech recognition where the initial HMM states are defined as entry model states that are not connected from any other entry states except themselves, the final HMM states are terminal model states that are not connected to any other states except themselves, and the body HMM states are any other intermediate states. The previous definition covers also the left to right HMMs with state skipping connections.
Typical speech recognizers use a context independent HMM (e.g., a monophone HMM) or a context dependent HMM (e.g., a biphone (left or right) HMM, demiphone HMM, triphone HMM, etc.). A context independent HMM does not take into consideration neighboring speech units when processing each base speech unit. In contrast, a context dependent HMM takes into account neighboring speech units when processing each base speech unit. For example, a typical biphone HMM takes into account a single neighboring phone (the previous phone is taken into account in left biphone HMMs, and the subsequent phone is taken into account in right biphone HMMS). Each state in a typical triphone HMM takes into account the previous phone and the subsequent phone. The previous definition of initial state, body state, and final state are valid for all left to right HMM monophones, biphones, and triphones. Other context dependent HMMs include demiphones, which are two connected sub-phonetic contextual units. A Demiphone includes a left demiphone part and a right demiphone part. Each demiphone part models a portion of a phone, has only one contextual dependency, and is a normal HMM. A left demiphone part models the phone beginning and takes into account the previous phone, while a right demiphone part models the phone ending and takes into account the subsequent phone. Demiphones can model the phone area evenly or unevenly. When a demiphone unevenly models the phone area one of the demiphone parts is dominant and has more states than the other one. For example, in a left dominant demiphone, the left demiphone part has more states than the right demiphone part. In a right dominant demiphone, the right demiphone part has more states than the left demiphone part. The initial state of a demiphone is an entry model state in the left demiphone part and is not connected from any other entry states except itself. The final state of a demiphone is a terminal model state in the right demiphone part and is not connected to any other states except itself. The body state(s) of a demiphone are the other states of the demiphone (different from the initial state and final state), and may be either included in the left demiphone part and/or the right demiphone part.
The following table illustrates a phone transcription of the sentence “I want” using typical context independent TIMIT monophones, and typical context dependent left biphones, right biphones, triphones, and demiphones.
TABLE 1I wantContext Independent Monophonessil ay w ao n t silContext Dependent Left Biphonessil sil-ay ay-w w-ao ao-n n-t silContext Dependent Right Biphonessil ay+w w+ao ao+n n+t t+sil silContext Dependent Triphonessil sil-ay+w ay-w+ao w-ao+n ao-n+t n-t+sil silContext Dependent Demiphonessil sil-ay ay+w ay-w w+ao w-ao ao+n ao-n n+t n-t t+sil sil
Each base speech unit can be represented with state transition probabilities {Aip} and output probability observation distributions {Bip(Ot)}. The output observation distributions are typically multivariate mixtures of Gaussian distributions and determine the probability of generating observation Ot (or input frame) at time t. The output observations are identified by the state index i and the speech unit index p, and the input observation at a time t.
In a context independent HMM, each speech unit (e.g., phone unit, word unit, function word unit, syllable unit, beginning and final syllables unit, etc.) has a single observation distribution for each state. Thus for an English speech recognizer using 40 unique phones and a three state context independent HMM per phone, the system uses a total of 120 observation distributions. Since context dependent HMMs take into consideration neighboring speech unit(s), they use more observation distributions than context independent HMMs. It is not unusual for the number of output observation distributions to range between 1,000 to 5,000 in a typical context dependent HMM speech recognizer. The number of observation distributions for context dependent HMMs can be limited by applying a uniform decision tree clustering algorithm or a uniform data driven clustering algorithm, however these algorithms use a uniform cluster threshold that is the same across each of the states of a phone.
Speech recognizers that use context dependent HMMs are typically more accurate than speech recognizers that use context independent HMMs, however they also generally require more memory and computational resources than speech recognizers using context independent HMMs. In addition, training context dependent HMMs requires significantly more training data than training context independent HMMs. In addition, the training data required to train triphone HMMs is greater than the data required to train biphone HMMs.
Some speech recognizers are trained in a training environment before the system is released, which reduces or eliminates an end user of the system from training the speech recognition system. Often this training environment is optimal for speech recognition where high accuracy is typically obtained. However, the environment of the real commercial scenarios (e.g., environments where the speech recognition system is commercially used) often differs from the training environment (e.g., different noises, etc.) and consequently the accuracy of the speech recognizer decreases. Different environmental variables may be taken into consideration in the training environment (e.g., different noises, reverberation, channel effects, etc.). However, it is possible that the environment that is ultimately used by the end users is different or cannot be taken into consideration during training.
Speech recognizers using typical context dependent biphones are accurate in matched conditions (where the environment is substantially the same during training and usage) but are inaccurate in mismatched conditions (where the environments of training and usage are different). Although accuracy can be improved using a noise robust front-end or back-end technology (e.g., feature transformation and normalization, noise attenuation, speech enhancement, HMM back-end noise compensation, etc.), the accuracy in mismatched conditions may not be acceptable. However, speech recognizers that use typical context independent HMMs (e.g., monophones), while being less accurate in the original training environment, are more robust to environmental changes as compared with speech recognizers using typical context dependent biphones HMMs.