Speech recognition tools translate human speech data into searchable text. Whether running on a desktop personal computer (PC) or an enterprise server farm, today's state-of-the-art speech recognizers exist as complex software running on conventional computers. This is profoundly limiting for applications that require extreme recognition speed. Today's most sophisticated recognizers fully occupy the computational resources of a high-end server to deliver results at, or near, real-time speed where each hour of audio input requires roughly one hour of computation for recognition. Applications range from homeland security, such as searching through large streams of audio intercepts for threats to national security, to video indexing, such as automatically creating a computer-readable text transcription from an audio component or soundtrack of a recorded video.
The high level architecture of a modern, state-of-the-art speech recognition system 10 is illustrated in FIG. 1. The speech recognition system 10 includes a feature extraction stage 12, an acoustic scoring stage 14, and a backend search stage 16. FIG. 2 is a graphical illustration of the speech recognition system 10. Generally, speech is acquired, digitized by an analog-to-digital converter (ADC) 18, and segmented into a sequence of overlapping windows at roughly millisecond-level granularity. From here, the first step in recognition is to extract meaningful information from each speech segment at the feature extraction stage 12. The feature extraction stage 12 uses digital signal processing (DSP) techniques find the best parameters, or features, to uniquely discriminate different sounds. This involves a set of filtering actions, spectral analysis (via Fast Fourier Transform (FFT)), nonlinear combination of spectral components in ways consistent with the physiology of the human auditory system, and the calculation of time derivatives of these quantities over several frames of speech to track dynamics. Several common methods have evolved, most notably Mel-Frequency Cepstral Coefficients (MFCC) and Perceptual Linear Prediction (PLP). At the output of the feature extraction stage 12, the features are assembled into a feature vector 20 and passed to the acoustic scoring stage 14. The feature vector 20 is a unique “fingerprint” for speech heard in one input frame.
Next, the acoustic scoring stage 14 receives the feature vector 20 for the speech heard in one input frame, and matches the feature vector 20 against a large library of stored atomic sounds. These atomic sounds are obtained from training over a very large number of speakers, all speaking from a target vocabulary. In the earliest recognizers, these atomic units of speech were phonemes, or phones, where phones are the smallest units of sound that distinguish meaning in different words. There are approximately 50 such phones in the English language, corresponding roughly to the familiar consonant and vowel sounds. For example, “five” has three phones: /f/ /i/ /v/, and “nine” also has three phones: /n/ /i/ /n/. Modern recognizers improve on this idea by modeling phones in context, as illustrated in FIG. 3. For example, the middle vowel /i/ sound in “five” is different than the middle vowel /i/ sound in “nine” because of context. Therefore, as a first complication, for the English language, roughly 50×50×50 sounds—now called triphones—are modeled as the library of recognizable acoustic units. Note that different languages have different basic phones, but the idea works across languages. Each of these ˜100,000 sounds is further decomposed into a set of frame-sized sub-acoustic units called senones. A complete model, which is referred to as an acoustic model, at this stage typically has several thousand senones.
As graphically illustrated in FIG. 4, the acoustic scoring stage 14 operates to match the feature vector 20 (a point in feature space) received from the feature extraction stage 12 against a library of atomic sounds (senones, each a complex region in feature space). The most common strategy is to model each senone as a Gaussian Mixture Model (GMM), where a GMM is a weighted sum of Gaussian density functions, each with an appropriately fit mean (center) and variance (radius). For every senone, the acoustic scoring stage 14 calculates a number—a calculated GMM probability—that the just-heard feature matches that senone. Assuming a diagonal covariance matrix, the GMM probability for each senone is calculated based on the following equation:
                    PROB        s            ⁡              (        X        )              =                  ∑                  i          =          1                          n          ⁡                      (            s            )                              ⁢                                    w                          s              ,              i                                                                                            (                                      2                    ⁢                    π                                    )                                d                            ⁢                                                                Λ                                      s                    ,                    i                                                                                                      ⁢                  exp          ⁡                      (                                          ∑                                  j                  =                  1                                d                            ⁢                                                -                                      1                                          2                      ⁢                                              σ                                                  s                          ,                          i                          ,                          j                                                2                                                                                            ⁢                                                      (                                                                  x                        j                                            -                                              μ                                                  s                          ,                          i                          ,                          j                                                                                      )                                    2                                                      )                                ,where n(s) is the number of Gaussians in the mixture, ws,i is a weight of the i-th Gaussian for senone s, |Λs,i| is the determinant of covariance matrix Λs,i for the i-th Gaussian for senone s, σs,i,j2 is the variance for the j-th dimension of d-dimensional density for the i-th Gaussian for senone s, xj is the j-th element of d-dimensional feature vector X, and μs,i,j is the j-th element of a d-dimensional mean for the i-th Gaussian for senone s.
In conventional usage, the logarithm (log) of the GMM probability is used for subsequent computational convenience. This log(probability) is calculated for each senone and delivered to the following backend search stage 16. A complex acoustic model can easily have 10,000 senones, each modeled with 64 Gaussians, in a space dimension between 30 and 50. The output of the acoustic scoring stage 14 is a vector of scores—10,000 log(probability) numbers, in this case—one per senone. Note that a new feature vector 20 is input to the acoustic scoring stage 14 for each frame of sampled speech. In response, the acoustic scoring stage 14 outputs a vector of scores including one score per senone for each frame of sampled speech based on the corresponding feature vector 20.
Returning to FIGS. 1 and 2, the backend search stage 16 uses a layered recognition model to first assemble features into phones (each modeled as sequence of senones), then into words (stored in a dictionary, each modeled as a sequence of phones). At the lowest, acoustic level of this process, Hidden Markov Models (HMMs) are used to model each phone where senones are the states in each HMM. At the top layer of this process, a language model provides additional statistical knowledge of likely word sequences to further aid in recognition. At its most fundamental level, the backend search stage 16 constructs a network in which the entire language is represented as a huge directed graph. This graph is itself a cross-product of three separate sets of graphs: the language model, which represents words in likely context; the phonetic model of each word, i.e., a linear sequence of phones for each word; and the acoustic model, which is a linear sequence of feature-matched senones for each phone. FIG. 5 shows an example of the language, phone, and acoustic layers of the backend search process for a simple example with a two-word vocabulary. There are two components to search: the construction of the graph, and the process of finding the best path through the graph. In practical recognition systems, the graph is vastly too large to be statically constructed, and so is built dynamically, frame by frame. Pruning operations remove unlikely (low probability) nodes as needed. The best path through the graph can be found using strategies such as the Viterbi algorithm. The result of the backend search is a sequence of recognized (i.e., the most likely) words.
A complex acoustic model comprises: a high-dimensional feature vector delivered every few milliseconds; a large library of stored atomic sounds called senones; and for each senone, a numerical GMM comprising a large set of high-dimensional Gaussian densities. For each frame of sampled speech (i.e., every few milliseconds), the acoustic scoring stage 14 is required to calculate a likelihood score—a log(probability)—for each senone in the acoustic model and deliver the scores for the senones to the backend search stage 16 for subsequent recognition of word fragments (phones), words, and word sequences (from a language model). Thus, one issue with conventional implementations of the speech recognition system 10 is that the acoustic scoring stage 14 is a bottleneck for applications that require extreme speed. As such, there is a need for a high-speed acoustic scoring stage for a speech recognition system.