Speech recognition, generally, is a process of converting an acoustic signal into a linguistic message. Automatic speech recognition is typically cast as an instance of information transmission over a noisy channel, which leads to the adoption of a statistical framework.
FIG. 1 is a block diagram illustrating a typical automatic speech recognition framework. As shown in FIG. 1, framework 200 includes a speech production side 101 and a speech recognition side 102. W refers to a sequence of words 103 intended to be produced and A refers to an acoustic realization of the word sequence 107. As shown in FIG. 1, the speech production 101 involves determination of the possibility 105 of the acoustic realization 107 based on the intended word sequence 103. Ŵ refers to a recognized sequence of words 113 output to the user. The speech recognition side 102 involves determination of the possibility 111 of the recognized sequence 113 based on the evidence observed 109. Typically, “production” and “recognition” terms are application-specific variants of the usual information-theoretic terminology “encoding” and “decoding.”
As shown in FIG. 1, blocks in framework 200 are assigned a set of parameters of the form Pr(⋅|⋅), indicating the possibility of a noisy process characterized by a statistical model. Since the purpose of speech recognition is to recover the word sequence most likely intended by the user, the output sequence Ŵ 113 satisfies the following:
                                          W            ^                    =                                    arg              W                        ⁢                                                  ⁢            max            ⁢                                                  ⁢                          Pr              ⁡                              (                                  W                  ❘                  A                                )                                                    ,                            (        1        )            where the maximization is done over all possible word sequences in the language. Using Bayes' rate, (1) is typically re-written as:
                                          W            ^                    =                                                    arg                ⁢                                                                              W                        ⁢            max            ⁢                                                  ⁢                          Pr              ⁡                              (                                  A                  ❘                  W                                )                                      ⁢                          Pr              ⁡                              (                W                )                                                    ,                            (        2        )            which has the advantage of decoupling the two main aspects of the process; the acoustic model Pr(A|W) 105, which is in evidence on the speech production (or training) side 101 of FIG. 1, and the language model Pr(W), which simply models the prior probability of the word sequence W in the language. Typically, acoustic and language models are Markovian in nature, and commonly involve hidden Markov model (“HMM”)/n-gram modeling.
In many applications such as automated dictation or compute data entry, it may be critical that the resulting message represent a verbatim transcription of a sequence of spoken words. Typically, a large vocabulary requires state-of-the-art acoustic modeling to use a large number of parameters. For example, while English has less than 50 phonemes (elementary units of sound), acoustic models in state-of-the-art systems commonly comprise tens to hundreds of thousands of parameters, e.g., Gaussian components. Typically, state-of-the-art acoustic models require such high dimensionality because of the extreme variability involved in the acoustic realization of the underlying phoneme sequence. As a result of this over-dimensioning, state-of-the-art acoustic systems consume a large amount of resources, which in turn makes them difficult to deploy on a mobile platform, e.g., the iPhone without compromising recognition accuracy.