1.1. Field of the Invention
The present invention relates to the field of computerized speech recognition.
1.2. Description and Disadvantages of Prior Art
In particular, the present invention relates to a method for operating a large vocabulary speech recognition system, in which a program-controlled recognizer performs the steps of:                1. dissecting a speech signal into short time intervals, i.e., frames, not necessarily of equal length yielding an extracted feature vector for each frames, e.g. comprising spectral coefficients,        2. labelling frames by characters or groups of them yielding a plurality of labels per frame,        3. decoding said labels to construct one or more words or fragments of a word,        4. in which method a plurality of recognizers are accessible to be activated for speech recognition, and are combined on an on-demand basis in order to improve the results of speech recognition done by a single recognizer.        
More particularly, such above mentioned continuous speech recognizers capture the many variations of speech sounds by modelling context dependent subword units, like e.g., phones or triphones, as elementary Hidden Markov Models, further referred to as “HMM”. Statistical parameters of these models are usually estimated from several hundred hours of labelled training data. While this allows a high recognition accuracy if the training data sufficiently matches the acoustic characteristics of the application scenario, it can be observed that recognition accuracy significantly decreases if the speech recognizer has to cope with acoustic environments with significant different, and possibly highly dynamically varying characteristics.
Both online and (un-)supervised batch adaptation techniques tackle the problem by a re-estimation of the acoustic model parameters, but are either infeasible if only a very small amount of data is available and/or the computational resources are sparse, or—in case of batch adaptation—can not properly deal with dynamic changes in the acoustic environment.
Today's large vocabulary continuous speech recognizers employ Hidden Markov Models (HMM) to compute a word sequence w with maximum a posteriori probability from a speech signal.
A Hidden Markov Model is a stochastic automaton □=(π, A, B) that operates on a finite set of states S={s1, . . . , sN} and allows for the observation of an output each time t, t=1,2, . . . , □, a state is occupied.
The initial state vectorπ=[πi]=[P(s(1)=si)], 1≦i≦N  (1)gives the probabilities that the HMM is in state si at time t=1, and the transition matrixA=[aij]=[P(s(t+1)=sj|s(t)=si)], 1≦i, j≦N  (2)holds the probabilities of a first order time invariant process that describes the transitions from state si to sj. The observations are continuous valued feature vectors x ∈ R derived from the speech signal, and the output probabilities are defined by a set of probability density function, further referred to herein as pdfs:B: [bi]=[p(x|s(t)=si)], 1≦i≦N  (3)
For any given HM state si the unknown distribution p(x|si) is usually approximated by a mixture of elementary Gaussian pdfs
                                                                        p                ⁢                                                                  ⁢                                  (                                      x                    ❘                                          s                      i                                                        )                                            =                                                ∑                                      j                    ⁢                                                                                  ∈                                          M                      i                                                                      ⁢                                  (                                                                                    w                        ji                                            ·                      N                                        ⁢                                                                                  ⁢                                          (                                                                        x                          ❘                                                      μ                            ji                                                                          ,                                                  Γ                          ji                                                                    )                                                        )                                                                                                                        =                                                      ∑                                          j                      ⁢                                                                                          ∈                                              M                        i                                                                              ⁢                                      (                                                                                                                                                      w                              ji                                                        ·                                                                                                                                                            2                                  ⁢                                  π                                  ⁢                                                                                                                                          ⁢                                                                      Γ                                    ji                                                                                                                                                                                                                                -                                  1                                                                /                                2                                                                                      ·                                                                                                                                                                            exp                            ⁢                                                                                                                  ⁢                                                          (                                                                                                -                                                                                                            (                                                                              x                                        -                                                                                  μ                                          ji                                                                                                                    )                                                                        T                                                                                                  ⁢                                                                                                                                            Γ                                      ji                                                                              -                                        1                                                                                                              ⁡                                                                          (                                                                              x                                        -                                                                                  μ                                          ji                                                                                                                    )                                                                                                        /                                  2                                                                                            )                                                                                                                                            )                                                              ⁢                                                          ,                                                          (        4        )            where Mi is the set of Gaussians associated with state si. Furthermore, x denotes the observed feature vector, wji is the j-th mixture component weight for the i-th output distribution, and μji and Γji are the mean and covariance matrix of the j-th Gaussian in state si. It should be noted that state and mixture component index of the mean vectors from Eqn. 4 are omitted for simplicity of notation.
State-of-the-art speech recognizers usually consist of the following components:                Feature extraction computes a parametric representation that allows the classification of short portions (frames) of the signal. Frequently used features are either spectral parameters or Mel-Frequency-Cepstrum coefficients (MFCC) which are often enriched by energy values and their time derivatives.        A “labeller” tags each feature vector with a number of labels that represent possible meaningful sub-word units such as a context dependent phones or sub-phones. Common techniques for the classification of feature vectors include, for example, statistical classification with Gaussian mixture densities or classification by use of a neural network.        A “decoder” interprets each label as the output of a HMM and computes a word sequence of maximum a posteriori probability. In order to efficiently cope with alternative results from the labelling step search strategies and pruning techniques are employed. Popular examples are asynchronous stack decoding and time synchronous Viterbi decoding or beam search.        
It has been demonstrated recently that a significant reduction in word error rate can be achieved by the combination of (intermediate) results from several base recognizers that run in parallel. Three main approaches can be distinguished:                Feature combination methods compute different sets of features and compose them into a single feature vector that is passed to the labeller.        Likelihood combination methods also compute different feature vectors, but classify them separately. Results from different labelling steps are combined based on their evidence, and for each frame a single vector of alternative labels is passed to the decoder.        ROVER (Recognizer Output Voting Error Reduction) is a post-processing method that uses a dynamic programming technique to merge the outputs from several decoder passes into a single word hypothesis network. At each branching point of the combined network a subsequent voting mechanism selects the word with the highest score for the final transcription.        
It is the main goal of the invention proposed here to overcome some problems associated with these methods, while simultaneously maintaining the increased recognition accuracy.
Introduction to the Problem
It is well known in prior that the recognition accuracy of a speech recognizer decreases significantly if used in an acoustic environment that is not properly represented in the training data. In applications such as desktop dictation this problem can easily be tackled by allowing the end user to enrol to the system in different environments, and methods for the normalization of the incoming feature vectors may be considered as well. However, facing the important role of speech as an input medium in pervasive computing, there is a growing number of applications that do not allow an upfront adaptation step. Moreover, if the recognizer has to deal with a potentially large number of dynamically changing acoustic environments, adaptation methods may become infeasible either due to a lack of a sufficient amount of online adaptation data or because of limited computational resources.
A more accurate acoustic model with a very large number of parameters may help to overcome this situation, but is not feasible in typical applications targeted in the invention reported here. These are—amongst others—applications such as interactive voice response solutions, voice driven interfaces for consumer devices (mobile phones, PDAs, home appliances), and low resource speech recognition in the car.
It has been proven in the literature that the combination methods mentioned above can yield significant better accuracy in noisy environments than a single base recognizer. However, these methods impose an increasing computational load to the CPU and also require an increased amount of memory for the storage of several acoustic models and intermediate results; therefore they are not yet suited for low resource speech recognizers.