Large vocabulary continuous speech recognition is realized with the use of the acoustic model and the language model. The acoustic model is used to calculate a score (referred to as an “acoustic score,” hereinafter) representing a degree to which a speech sound acoustically resembles the utterance of a word. The language model is used to calculate a score (referred to as a “language score,” hereinafter) representing a degree to which words are easily linked to each other. Ideally, the ratio of the acoustic score to the language score is 1:1.
However, as described on page 93 of Non-Patent Document 1, a value obtained by the acoustic model is approximated by the probability density distribution like normal distribution, and the resultant value is used to approximate. Moreover, as described on page 192 of Non-Patent Document 1, the language model is approximated by N-gram with a preceding N−1 word being used as a condition.
In that manner, both the acoustic model and the language model use the approximated models. Therefore, each score is multiplied by a weighting factor to ensure consistency in bias between the acoustic score and the language score. Here, the bias means a phenomenon of an approximated value becoming larger than an original value. In the field of audio recognition, several values are prepared in advance as weighting factors, and are selected while the recognition rate of test audio data is observed. According to such a method, it is considered that there is no problem if there is one pair of the acoustic model and the language model. However, if there is a plurality of pairs of the acoustic model and the language model or if new scores are combined, the number of parameters that should be prepared increases at an exponential order. Therefore, it is considered impossible to calculate.
For such problems, as described in Non-Patent Document 2, in the field of statistical machine translation, there is a widely known method according to which the weighting factors are adjusted by the maximum entropy method (referred to as “ME method,” hereinafter) with respect to the score obtained from a different probabilistic model.
As described on pages 155 to 174 of Non-Patent Document 3, the ME method is to maximize entropy under constraint conditions, and is a learning scheme to estimate a uniform distribution function with respect to unknown data. According to the scheme, it is known that if the maximum likelihood estimation is used as a constraint condition, the estimated distribution function is a logistic function as shown in the following equation (1):
                              Equation          ⁢                                          ⁢                      (            1            )                          ⁢                                                                                                P          ⁡                      (                          w              |              o                        )                          =                              exp            ⁢                          {                                                ∑                  k                                ⁢                                                                  ⁢                                                      λ                    k                                    ⁢                                                            f                      k                                        ⁡                                          (                                              w                        ,                        o                                            )                                                                                  }                                                          ∑              w                        ⁢                                                  ⁢                          exp              ⁢                              {                                                      ∑                    k                                    ⁢                                                                          ⁢                                                            λ                      k                                        ⁢                                                                  f                        k                                            ⁡                                              (                                                  w                          ,                          o                                                )                                                                                            }                                                                        [                  Math          .                                          ⁢          1                ]            
where k is a natural number representing the number of models (number); w and o are an output sequence and an input sequence, respectively. In the case of Non-Patent Document 2, w and o are a sequence of English words and a sequence of French words, respectively. fk(w, o) is a score calculated by each model. In the case of Non-Patent Document 2, f1(w, o) is the logarithm of the generation probability that an English word appears from a French word; f2(w, o) is the logarithm of the probability that a sequence of English words appears. λk represents a weighting factor of a score calculated by each probabilistic model, and is optimized so that with a combination of the correct-answer w and o, the posterior probability P(w|o) has the largest value.
Here, the denominator of the equation (1) means that all the combinations of the output sequences w are added up. However, if the number of elements constituting the output sequences w increases (in the case of Non-Patent Document 2, the number of different English words), the number of combinations increases, making it impossible to calculate the denominator of the equation (1). In the field of statistical machine translation like the one described in Non-Patent Document 2, some approaches, including the following one, are taken to address the above problem: information about words that do not consecutively appear is used as prior knowledge to narrow the number of combinations of word sequences down to a finite number.    Non-Patent Document 1: S. Young and 10 others, “The HTK Book for HTK version 3.3,” Cambridge University Engineering Department, April 2005, pp. 1-345    Non-Patent Document 2: F. J. Och and one other, “Discriminative Training and Maximum Entropy Models for Statistical Machine Translation,” Proc. ACL, July 2002, pp. 295-302    Non-Patent Document 3: Kita, “Language model and calculation 4: Probabilistic language model,” University of Tokyo Press, 1999    Non-Patent Document 4: Lafferty and two others, “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data,” In Proc. Of ICML, pp. 282-289, 2001