As is generally known, HMM (Hidden Markov Model) based speech recognition is achieved by a system configuration shown in FIG. 1.
FIG. 1 shows a conventional HMM based one-pass speech recognition system, wherein the system includes an endpoint detector 101, a feature extractor 103, a Viterbi decoder 105 and a storage 107.
The endpoint detector 101 accurately detects a speech signal section of an input signal in a varying background noise environment to provide the speech signal section to the feature extractor 103, wherein the speech signal section is detected with a variety of parameters used for dividing a signal into a speech and a non-speech signal section.
The feature extractor 103 transforms the speech signal section received from the endpoint detector 101 into feature parameters suitable for a speech recognition by mainly using an MFCC (Mel-Frequency Cepstrum Coefficient) or a PLPCC (Perceptual Linear Prediction Cepstrum Coefficient) to provide the feature parameters to the Viterbi decoder 105.
The Viterbi decoder 105 finds a path of a word or a word phoneme sequence having a maximum likelihood in a search space, wherein the search space includes a linkage structure of within-vocabulary and words, i.e., an HMM based word model 1071, an acoustic model 1073, a pronunciation model 1075, and a word based language model 1077; and a feature parameter sequence received from the feature extractor 103.
FIG. 2 describes a conventional within-vocabulary model and a memory structure for loading the model. Referring to FIG. 2, the within-vocabulary model, e.g., having two Korean words “goryeogaebal” and “goryeogiwon”, is defined with phoneme nodes in which the phonemes form the words and arcs which represent connection states of the phoneme nodes. Accordingly, in order to load an entire preset within-vocabulary models, a capacity of a memory 201 needs to be the number of total phonemes used for representing within-vocabulary multiplied by the sum of a memory capacity necessary for representing HMM and for defining arcs. Equation 1 is a dynamic program for finding a likelihood of an optimal path in Viterbi-decoding algorithm.
                              1.          ⁢                                          ⁢          Initialization          ⁢                      :                          ⁢                                  ⁢                                  ⁢                                                            δ                1                            ⁡                              (                i                )                                      =                                          π                i                            ·                                                b                  i                                ⁡                                  (                                      x                    1                                    )                                                              ,                      1            ≤            i            ≤            N                          ⁢                                  ⁢                  2.          ⁢                                          ⁢          Recursion          ⁢                      :                          ⁢                                  ⁢                                  ⁢                                                            δ                t                            ⁡                              (                j                )                                      =                                          max                i                            ⁢                                                {                                                                                    δ                                                  t                          -                          1                                                                    ⁡                                              (                        i                        )                                                              ·                                          a                                              i                        ,                        j                                                                              }                                ·                                                      b                    j                                    ⁡                                      (                                          x                      i                                        )                                                                                ,                                          ⁢                                          ⁢                      1            ≤            i                    ,                      j            ≤            N                    ,                      2            ≤            t            ≤            T                          ⁢                                  ⁢                  3.          ⁢                                          ⁢          Termination          ⁢                      :                          ⁢                                  ⁢                                  ⁢                              P            *                    =                                                    arg                ⁢                                                                  ⁢                max                            i                        ⁢                          {                                                δ                  T                                ⁡                                  (                  i                  )                                            }                                                          Equation        ⁢                                  ⁢        1            
Wherein, N indicates the number of states of HMM forming within-vocabulary, T represents the number of frames of an input feature vector, TTi indicates the initial state distribution, and ai,j indicates the transition probabilities.
Further, as shown in Equation 2, an amount of operations necessary for finding an optimal path by searching a search space having a size of multiplication of N and T may be defined as C, which is an amount of computations necessary for the recursion operations of Equation 1.
                                                        C              =                            ⁢                                                (                                                            N                      ·                      T                                        -                                                                  ∑                                                  n                          =                          1                                                                          N                          -                          1                                                                    ⁢                      n                                                        )                                ·                                  C                  r                                                                                                        ≈                            ⁢                              N                ·                T                ·                                  C                  r                                                                                        Equation        ⁢                                  ⁢        2            
Almost all methods relating to a high-speed recognition use a method for reducing only an amount of computations for observation probability operations bj(xt) in the recursion operations; or a two-stage search method formed with a fast match and a detailed match.
The fast match of the two-stage search method, however, is not a method for reducing a search space, but for increasing an entire recognition speed by reducing the observation probability operations bj(xt). Therefore, the fast match has a drawback that the recognition speed will be drastically decreased if the within-vocabulary increases.