Speech recognition (voice recognition) is a computer technology for converting an acoustic signal, e.g., a voice signal obtained through a microphone or a telephone, into a corresponding text, e.g., a word, word sets and sentences. Among a variety of speech recognition algorithms, the most widely used one is HMM (Hidden Markov Model) based speech recognition. The HMM based speech recognition is a stochastic speech recognition algorithm including two independent processes, i.e., a training process and a recognition process.
In the training process, acoustic features of a target word are stochastically modeled. In the recognition process, similarities between an input speech signal and trained models are measured to output, as recognition result, a word corresponding to a model having the maximum similarity or a word corresponding to a state sequence forming a model.
FIG. 1 illustrates a block diagram of a conventional HMM (Hidden Markov Model) based speech recognition system 100. The speech recognition system 100 may include a Viterbi decoder, a word model management unit 120, an acoustic model unit 130 and a dictionary unit 140.
The acoustic model unit 130 manages trained and mathematically modeled phoneme models which are basic units in speech recognition.
The dictionary unit 140 provides phonetic sequences for recognition target words.
The word model management unit 120 manages, based on the phoneme models, word models corresponding to the recognition target words. The word models are configured with reference to the phonetic sequences of the recognition target words provided by the dictionary unit 140.
The Viterbi decoder 140 measures similarities between an observation vector sequence and the word models managed by the word model management unit 120 to output as recognition results a word having the maximum similarity. Here, the Viterbi decoder 140 measures similarity between a speech signal and a recognition model (trained model) by using Viterbi algorithm.
The Viterbi algorithm presents a dynamic programming solution to find the most likely path. A partial maximum likelihood δt(j) of a state j at a time t is recursively calculated using Equation 1:
                                                        δ              t                        ⁡                          (              j              )                                =                                    max              i                        ⁢                                          ⌊                                                                            δ                                              t                        -                        1                                                              ⁡                                          (                      i                      )                                                        ·                                      α                    ij                                                  ⌋                            ·                                                b                  j                                ⁡                                  (                                      o                    t                                    )                                                                    ,                            Equation        ⁢                                  ⁢        1            wherein αij is a transition probability to the state j from a state i, and bj(ot) is an observation probability in the state j to output an observation vector ot at the time t.
For a speech signal including impulse noises, observation probabilities for observation vectors including the noises are in general much lower than those for noise-free observation vectors, which results in dispersion of partial maximum likelihoods and increase of erroneous recognition results. In order to obtain stable recognition results from a speech signal including impulse noises, modified Viterbi algorithms have been proposed. The partial maximum likelihood δt(i) according to the modified Viterbi algorithms is calculated by using Equation 2:
                                                        δ              t                        ⁡                          (              j              )                                =                                    max              i                        ⁢                                          ⌊                                                                            δ                                              t                        -                        1                                                              ⁡                                          (                      i                      )                                                        ·                                      a                    ij                                                  ⌋                            ·                                                f                  j                                ⁡                                  (                  t                  )                                                                    ,                            Equation        ⁢                                  ⁢        2            wherein ƒj(t) is a function for an observation probability bj(ot).
Among the modified Viterbi algorithms, the most widely used one is weighted Viterbi algorithm. The function ƒj(t) of the weighted Viterbi algorithm is as in Equation 3:ƒj(t)=bj(ot)γt,  Equation 3wherein a weight γt represents reliability of the observation vector ot. The weight γt is in a range from 0 to 1.0 and increases in proportion to the observation probability bj(ot), thus minimizing the erroneous recognition results due to the noises. In general, the reliability is measured using an SNR (Signal-to-Noise Ratio) of a speech period to which a corresponding observation vector belongs.
An alternative of the function ƒj(t) is as in Equation 4:
                                          f            j                    ⁡                      (            t            )                          =                  {                                                                                          b                    j                                    ⁡                                      (                                          o                      t                                        )                                                                                                                    if                    ⁢                                                                                  ⁢                                                                  b                        j                                            ⁡                                              (                                                  o                          t                                                )                                                                              ≥                                      T                    l                                                                                                                        T                  l                                                                              otherwise                  ,                                                                                        Equation        ⁢                                  ⁢        4            wherein Tl is a threshold. If the observation probability bj(ot) is less than the threshold Tl, the observation probability bj(ot) is replaced with the threshold Tl, thereby preventing an excessive decrease in the observation probability bj(ot).
The above-described Viterbi algorithms basically based on observation independence assumption ensure relatively stable recognition performance even in a case where a speech signal includes noises. However, since consecutive frames in a speech signal are closely correlated, more improved recognition performance cannot be achieved via these Viterbi algorithms.