The more advanced software speech recognition systems are now approaching their ultimate goal: large-vocabulary, continuous, speaker-independent, real-time speech recognition. However, while accurate, these systems are extremely computationally intensive, requiring the full processing resources of a modern desktop to run in real-time. Such heavy computational requirements either rule out many applications for speech recognition or require making tradeoffs in accuracy.
A high level architecture of a modern, state-of-the-art speech recognition system 10 is illustrated in FIG. 1. The speech recognition system 10 is implemented in software and includes a feature extraction stage 12, an acoustic scoring stage 14, and a backend search stage 16. Generally, speech is acquired, digitized by an analog-to-digital converter (ADC), and segmented into a sequence of overlapping windows at roughly millisecond-level granularity. From here, the first step in speech recognition is to extract meaningful information from each speech segment at the feature extraction stage 12. The feature extraction stage 12 uses digital signal processing (DSP) techniques to find the best parameters, or features, to uniquely discriminate different sounds. This involves a set of filtering actions, spectral analysis (via Fast Fourier Transform (FFT)), nonlinear combination of spectral components in ways consistent with the physiology of the human auditory system, and the calculation of time derivatives of these quantities over several frames of speech to track dynamics. Several common methods have evolved, most notably Mel-Frequency Cepstral Coefficients (MFCC) and Perceptual Linear Prediction (PLP). At the output of the feature extraction stage 12, the features are assembled into a feature vector and passed to the acoustic scoring stage 14. The feature vector is a unique “fingerprint” for speech heard in one input frame.
Next, the acoustic scoring stage 14 receives the feature vector for the speech heard in one input frame, and matches the feature vector against a large library of stored atomic sounds. These atomic sounds are obtained from training over a very large number of speakers, all speaking from a target vocabulary. In the earliest recognizers, these atomic units of speech were phonemes, or phones, where phones are the smallest units of sound that distinguish meaning in different words. There are approximately 50 such phones in the English language, corresponding roughly to the familiar consonant and vowel sounds. For example, “five” has three phones: /f/ /i/ /v/, and “nine” also has three phones: /n/ /i/ /n/. Modern recognizers improve on this idea by modeling phones in context, as illustrated in FIG. 2. For example, the middle vowel /i/sound in “five” is different than the middle vowel /i/ sound in “nine” because of context. Therefore, as a first complication, for the English language, roughly 50×50×50 sounds—now called triphones—are modeled as the library of recognizable acoustic units. Note that different languages have different basic phones, but the idea works across languages. Each of these ˜100,000 sounds is further decomposed into a set of frame-sized sub-acoustic units called senones. A complete model, which is referred to as an acoustic model, at this stage typically has several thousand senones.
Thus, the goal of the acoustic scoring stage 14 is to match the feature vector (a point in feature space) received from the feature extraction stage 12 against a library of atomic sounds (senones, each a complex region in feature space). The most common strategy is to model each senone as a Gaussian Mixture Model (GMM), where a GMM is a weighted sum of Gaussian density functions, each with an appropriately fit mean (center) and variance (radius). For every senone, the acoustic scoring stage 14 calculates a number—a calculated GMM probability—that the just-heard feature matches that senone. Assuming diagonal covariance matrix, the GMM probability for each senone is calculated based on the following equation:
                    PROB        s            ⁡              (        X        )              =                  ∑                  i          =          1                          n          ⁡                      (            s            )                              ⁢                                    w                          s              ,              i                                                                                            (                                      2                    ⁢                    π                                    )                                d                            ⁢                                                                Λ                                      s                    ,                    i                                                                                                      ⁢                  exp          ⁡                      (                                          ∑                                  j                  =                  1                                d                            ⁢                                                -                                      1                                          2                      ⁢                                              σ                                                  s                          ,                          i                          ,                          j                                                2                                                                                            ⁢                                                      (                                                                  x                        j                                            -                                              μ                                                  s                          ,                          i                          ,                          j                                                                                      )                                    2                                                      )                                ,where n(s) is the number of Gaussians in the mixture, ws,i is a weight of the i-th Gaussian for senone s, |Λs,i| is the determinant of covariance matrix Λs,i for the i-th Gaussian for senone s, σs,i,j2 is the variance for the j-th dimension of d-dimensional density for the i-th Gaussian for senone s, xj is the j-th element of d-dimensional feature vector X, and μs,i,j is the j-th element of a d-dimensional mean for the i-th Gaussian for senone s.
In conventional usage, the logarithm (log) of the GMM probability is used for subsequent computational convenience. This log(probability) is calculated for each senone and delivered to the following backend search stage 16. A complex acoustic model can easily have 10,000 senones, each modeled with 64 Gaussians, in a space dimension between 30 and 50. The output of the acoustic scoring stage 14 is a vector of scores—10,000 log(probability) numbers, in this case—one per senone. Note that a new feature vector is input to the acoustic scoring stage 14 for each frame of sampled speech. In response, the acoustic scoring stage 14 outputs a vector of scores including one score per senone for each frame of sampled speech based on the corresponding feature vector.
The backend search stage 16 delivers a set of most-likely-to-be-heard words as its output based on senone scores provided by the acoustic scoring stage 14 for each frame of sampled speech. Specifically, the backend search stage 16 uses a layered recognition model to first assemble features into triphones (each modeled as a sequence of senones), then into words (stored in a dictionary, each modeled as a sequence of triphones). At the lowest, acoustic level of this process, Hidden Markov Models (HMMs) are used to model each triphone where senones are the states in each HMM. As illustrated in FIG. 3, each triphone is modeled as a linear sequence of states (senones). Looping self-arrows allow for an individual senone to extend over the time for more than one frame of sampled speech. Rightward arrows model progression from one senone, or atomic sound, to another in this triphone. Each transition (represented by an arrow) has a transition probability. An ending “null” state allows the triphone to connect to a following triphone. Mechanically, for each frame of sampled speech, the acoustic scoring stage 14 delivers a set of senone scores, or log(probability) numbers, including a score for each senone. The backend search stage 16 then scores triphones based the senone scores for states of the corresponding HMMs.
Each word in a vocabulary of the speech recognition system 10 to be recognized is decomposed into a set of “overlapping” triphones, i.e., the ending context of one triphone is the beginning context of the next. FIG. 4 shows an example of connecting the triphones /h/ and /i/ in “hi,” preceded and followed by silences.
At the top layer of the backend search process, a language model provides additional statistical knowledge of likely word sequences to further aid recognition. As illustrated in FIG. 5, an n-gram model stores probabilities for individual words (unigrams), two-word (bigram), and three-word (trigram) sequences. At its most fundamental level, the backend search stage 16 constructs a network in which the entire language is represented as a huge directed graph. This graph is itself a cross-product of three separate sets of graphs: a language model, which represents words in likely context; a phonetic model of each word, i.e., a linear sequence of phones for each word; and an acoustic model, which is a linear sequence of feature-matched senones for each phone. FIG. 6 shows an example of the language, phone, and acoustic layers of the backend search process for a simple example with a two-word vocabulary. There are two components to search: the construction of the graph, and the process of finding the best path through the graph. In practical recognition systems, the graph is vastly too large to be statically constructed, and so is built dynamically, frame by frame. Pruning operations remove unlikely (low probability) nodes as needed. The best path through the graph can be found using strategies such as the Viterbi algorithm. The result of the backend search is a sequence of recognized (i.e., the most likely) words.
Due to its complexity, the software-based speech recognition system 10 requires significant processing resources in order to operate. This is especially true for real-time, high-accuracy speech recognition for relatively large vocabularies. Such processing resources are unavailable on resource-constrained platforms such as mobile phones, which face severe limitations on size, weight, computational processing capability, memory storage, battery lifetime, and power consumption. Power consumption is the most severe resource limitation because it limits overall computational capabilities and thus speech recognition quality and usability. Such power consumption limitations prevent high accuracy speech recognition from being fully implemented within such resource-constrained devices. As such, there is a need for a high-accuracy, low-power speech recognition system suitable for use in resource-constrained devices such as mobile phones.