Research in speech recognition has yielded long-established statistical methods in which speech recognition is performed using acoustic models and linguistic models (see, for example, Patent Literature 1 through 7). In speech recognition, finding a word string W having the largest posterior probability P(W|X) with respect to the feature value X of the inputted speech has been formulated as a problem, and this is modified as shown in Equation (1) below.
                                                                        W                ′                            =                              arg                ⁢                                                                  ⁢                max                ⁢                                                                  ⁢                                  P                  ⁡                                      (                                          W                      ❘                      X                                        )                                                                                                                          =                              arg                ⁢                                                                  ⁢                max                ⁢                                                                  ⁢                                                      P                    T                                    ⁡                                      (                                          W                      ,                      X                                        )                                                                                                                          =                                                arg                  ⁢                                                                          ⁢                  max                  ⁢                                                                          ⁢                                                            P                      L                                        ⁡                                          (                      W                      )                                                        ⁢                                                            P                      A                                        ⁡                                          (                                              X                        ❘                        W                                            )                                                                      -                                  (                  1                  )                                                                                        Equation        ⁢                                  ⁢        1            In other words, in speech recognition, the word series W′ having the largest product (arg max) of the probability of occurrence PA (X|W) of the feature quantity X when the word string W is uttered and the probability of occurrence PL (W) of the word string W itself is selected as the recognition result.
Here, an acoustic model is used to determine the probability PA (X|W) of the former, and words with a high degree of probability are selected as recognition candidates. Recognition using an acoustic model is performed by matching the vocabulary to be recognized with a dictionary that defines its pronunciation. A linguistic model, more specifically, an N-gram model, is used to approximate the probability PL (W) of the former. In this method, the probability of the entire statement, that is, the word string W, is approximated from the probability of the occurrence of N consecutive word sets.
Recently, web-based text has been used and utilized in the learning of dictionaries and linguistic models. This includes innumerable words with the same pronunciation but different notations (referred to below as “homophones with different notations”). In the linguistic models of ordinary speech recognition systems, a linguistic probability is assigned to each of the notations. When there is a plurality of notations for a semantically equivalent word, the linguistic probability of the word is divided to the notations. As mentioned above, in ordinary speech recognition, the string of notations with the largest product of linguistic probability and acoustic probability is outputted. However, because the linguistic probability of the word is divided to the notations, the linguistic probability of each of the notation is lower than that of the word, and the word with the correct pronunciation often cannot be selected in speech recognition results selected by maximizing the probability product in the manner described above.
If phoneme coding is performed to perform speech recognition using phonemes as the unit of recognition rather than words, recognition results emphasizing the pronunciation can be obtained. However, the recognition accuracy is generally lower because the linguistic models used in phoneme coding are poor.