More and more devices and services are now using speech input/output as a man/computer interface. For instance, speech input/output is used for operating a portable telephone. As a basis of speech input/output, recognition accuracy of speech recognition devices must be as high as possible.
A common technique of speech recognition uses a model obtained by statistical machine learning. For example, HMM (Hidden Markov Model) is used as an acoustic model. Further, a word pronunciation dictionary for calculating the probabilities of character sequences generated in the process of speech recognition being obtained from the state sequences of HMM, and a language model for calculating the probabilities of appearances of word sequences of a certain language, are also used.
For performing such a process, a conventional speech recognition device includes: a framing unit for dividing the speech signals into frames; a feature generating unit for calculating features such as mel frequency cepstrum coefficients from each frame and forming a sequence of multi-dimensional feature vectors; and a decoder responsive to the sequence of feature vectors for outputting the word sequence having the highest likelihood of providing the sequence of feature vectors, utilizing the acoustic model and the language model. In calculating the likelihood, state transition probability and output probability from each state of HMM forming the acoustic model play important roles. These are both obtained through machine learning. The output probability is calculated by a pre-trained Gaussian mixture model.
Basic concept of speech recognition of a conventional speech recognition device will be described with reference to FIG. 1. Conventionally, it is assumed that a word sequence 30 (word sequence W) is influenced by various noises and observed as an observed sequence 36 (observed sequence X), and a word sequence that is expected to have the highest likelihood of generating the finally observed sequence X is output as a result of speech recognition. Let P(W) represent the probability of a word sequence W being generated. Further, let P(S|W) represent the probability of a state sequence S (state sequence 34) of HMM being generated from the word sequence W through a phoneme sequence 32 as an intermediate product. Further, let P(X|S) represent the probability of observed X being obtained from the state sequence S.
In the process of speech recognition, as shown by the first equation of FIG. 2, when an observed sequence X1:T from the start to a time point T is given, a word sequence that has the highest likelihood of generating such an observed sequence is output as a result of speech recognition. Specifically, the word sequence {tilde over (W)} as the result of speech recognition is calculated by the equation below. The sign “˜” appearing above a character in the equation is depicted immediately preceding the corresponding character in the texts of this Specification.
                              W          ~                =                                            arg              ⁢                                                          ⁢              max                        W                    ⁢                                    P              ⁡                              (                                  W                  ⁢                                      |                                    ⁢                                      X                                          1                      :                      T                                                                      )                                      .                                              (        1        )            By modifying the right side of this equation in accordance with Bayes' theorem, we obtain
                              W          ~                =                                            arg              ⁢                                                          ⁢              max                        W                    ⁢                                                    P                ⁢                                  (                                                            X                                              1                        :                        T                                                              ⁢                                          |                                        ⁢                    W                                    )                                ⁢                                  P                  ⁡                                      (                    W                    )                                                                              P                ⁡                                  (                                      X                                          1                      :                      T                                                        )                                                      .                                              (        2        )            
Further, the first term of the numerator can be calculated by HMM asP(X1:T|W)≅P(X1:T|S1:T)P(S1:T|W).  (3)Here, the state sequence S1:T represents a state sequence S1, . . . , ST of HMM. The first term of the right side of Equation (3) represents output probability of HMM. From Equations (1) to (3), the word sequence {tilde over (W)} as the result of speech recognition can be given by
                              W          ~                =                                            arg              ⁢                                                          ⁢              max                        W                    ⁢                                                                      P                  ⁡                                      (                                                                  X                                                  1                          :                          T                                                                    ⁢                                              |                                            ⁢                                              S                                                  1                          :                          T                                                                                      )                                                  ⁢                                  P                  ⁡                                      (                                                                  S                                                  1                          :                          T                                                                    ⁢                                              |                                            ⁢                      W                                        )                                                  ⁢                                  P                  ⁡                                      (                    W                    )                                                                              P                ⁡                                  (                                      X                                          1                      :                      T                                                        )                                                      .                                              (        4        )            
In HMM, an observed value xt at time point t depends only on the state st. Therefore, the output probability P(X1:T|S1:T) of HMM in Equation (4) can be calculated by the equation below.
                              P          ⁡                      (                                          X                                  1                  :                  T                                            ⁢                              |                            ⁢                              S                                  1                  :                  T                                                      )                          =                              ∏                          t              =              1                        T                    ⁢                                    P              ⁡                              (                                                      X                    t                                    ⁢                                      |                                    ⁢                                      S                    t                                                  )                                      .                                              (        5        )            The probability P(xt|st) is calculated by Gaussian Mixture Model (GMM).
Among other terms of Equation (4), P(S1:T|W) is calculated by a product of state transition probability of HMM and pronunciation probability of a word, and P(W) is calculated by the language model. The denominator P(X1:T) is common to each hypothesis and, therefore, it is negligible when arg max operation is executed.
Recently, a framework called DNN-HMM hybrid has been studied wherein output probability of HMM is calculated by a Deep Neural Network (DNN) in place of GMM. The hybrid method using DNN-HMM is attracting attention as it attains higher accuracy than an acoustic model using GMM. Here, originally, a DNN output represents posterior probability P(St|Xt) and, therefore, it does not fit into the conventional framework using HMM that employs output probability (Xt|St). As a solution to this problem, Bayes' theorem is applied to the posterior probability P(St|Xt) output from DNN to modify it into a form the output probability (Xt|St).