A statistical method for using an acoustic model and a language model for speech recognition is well known, and has been featured in such publications as: “A Maximum Likelihood Approach to Continuous Speech Recognition,” L. R. Bahl, et. al., IEEE Trans. Vol. PAMI-5, No. 2, March, 1983; and “Word based approach to large-vocabulary continuous speech recognition for Japanese,” Nishimura, et. al., Information Processing Institute Thesis, Vol. 40, No. 4, April, 1999.
According to an overview of this method, a word sequence W is voiced as a generated sentence and is processed by an acoustic processor, and from a signal that is produced a feature value X is extracted. Then, using the feature value X and the word sequence W, assumed optimal recognition results W′ are output in accordance with the following equation to form a sentence. That is, a word sequence such that, when the word sequence W is voiced, the product of the appearance probability P (XW), of the feature value (X), and the appearance probability (P(W)), of the word sequence W, is the maximum (argmax) and is selected as the recognition results W′.
                              W          ′                =                                            arg              ⁢                                                          ⁢              max                        w                    ⁢                      P            ⁡                          (                              W                ❘                X                            )                                ⁢                                    arg              ⁢                                                          ⁢              max                        w                    ⁢                      P            ⁡                          (              w              )                                ⁢                      P            ⁡                          (                              X                ❘                W                            )                                                          [                  Equation          ⁢                                          ⁢          1                ]            where P(W) is for a language model, and P (W|X) is for an acoustic model.
In this equation, the acoustic model is employed to obtain the probability P(X|W), and words having a high probability are selected as a proposed word for recognition. This language model is frequently used to provide an approximation of the probability P(W).
For the conventional language model, normally, the closest word sequence is used as a history. An example is an N-gram model. With this method, an approximation of a complete sentence is produced by using the probability of the appearance of N sequential words, i.e., an approximation of the appearance probability of the word sequence W. This method is exemplified by the following established form.
                                          P            ⁡                          (              w              )                                =                      P            ⁢                          (                              w                0                            )                        ⁢                          P              ⁡                              (                                                      w                    1                                    ❘                                      w                    0                                                  )                                      ⁢                          P              ⁡                              (                                                      w                    2                                    ❘                                                            w                      0                                        ⁢                                          w                      1                                                                      )                                                    ,                                  ⁢        …        ⁢                                  ,                                                       P              ⁡                              (                                                                            w                      n                                        ❘                                                                  w                        0                                            ⁢                                              w                        1                                                                              ,                  …                  ⁢                                                                          ,                                      w                                          n                      -                      1                                                                      )                                      =                                          P                ⁡                                  (                                      w                    0                                    )                                            ⁢                              P                ⁡                                  (                                                            w                      1                                        ❘                                          w                      0                                                        )                                            ⁢                                                ∏                                      i                    =                    2                                    n                                ⁢                                                                  ⁢                                  P                  ⁡                                      (                                                                  w                        i                                            ❘                                                                        w                                                      i                            -                            2                                                                          ⁢                                                  w                                                      i                            -                            1                                                                                                                )                                                                                                          [                  Equation          ⁢                                          ⁢          2                ]            
Assume that in the above equation the appearance probability of the next word W[n] is affected only by the immediately preceding N−1 words. For this purpose, various values can be used for N, but since N=3 is frequently employed because of the balance it provides between effectiveness and the learning data that is required, in this equation, N=3 is employed, and the above method is therefore called a tri-gram or a 3-gram method. Hereinafter, when the n-th word in a word sequence W consisting of n words is represented by W[n], the appearance probability condition for the calculation of the word W[n] is that there are N−1 preceding words (two words), i.e., the appearance probability for the word sequence W is calculated using P(W[n]|W[n−2]W[n−1]). In this equation, the statement to the left (W[n]) of “|” represents a word to be predicted (or recognized), and the statement to the right (W[n−2]W[n−1]) represents the first and the second preceding words required to establish the condition. This appearance probability P(W[n]|W[n−2]W[n−1]) is learned for each word W[n] by using text data that have previously been prepared and stored as part of a dictionary database. For example, for the probability that a “word” will appear at the beginning of a sentence, 0.0021 is stored, and for the probability a “search” will follow, 0.001 is stored.
The Tri-gram model will now be described by using a simple phrase. This phrase is “sara-ni sho-senkyoku no (further, small electoral districts)” and is used to predict the following “donyu (are introduced)”. FIG. 8A is a diagram showing the state before the prediction is fulfilled, and FIG. 8B is a diagram showing the state after the prediction is fulfilled. As is shown in FIG. 8A, the phrase consists of five words, “sara-ni”, “sho”, “senkyo”, “ku” and “no”, while the predicted word is represented by “?”, and the arrows in FIGS. 8A and 8B are used to delineate the modifications applied to the words. As previously described, in the tri-gram model, two preceding words are constantly employed to predict a following word. Therefore, in this example, “donyu” is predicted by “ku” and “no”, words enclosed by solid lines in FIG. 8A.
However, depending on the sentence structure, the tri-gram method for employing two immediate words to predict a following word is not the most appropriate. For example, the tri-gram method is not appropriate for the case illustrated in FIG. 9, wherein the phrase “nani-ga ima seiji-no saisei-no tame-ni (at present, for reconstruction of the politics, what)” is used to predict a word. According to the tri-gram method, as is shown in FIG. 9A, “tame” and “ni” are employed to predict “hitsuyo (is required)”. But in addition to these words, other structurally related words, such as “nani” or “ima” must be taken into account in order to increase the accuracy of the prediction.
Chelba and Jelinek proposed a model for employing the head word of two immediately preceding partial analysis trees to predict a succeeding word. According to the Chelba & Jelinek model, the words are predicted in order, as they appear. Therefore, when the i-th word is to be predicted, the (i−1)th word and the structure are established. In this state, first, the head word of the two immediately preceding partial analysis trees are employed to predict, in the named order, the following word and its speech part. At this time, the modification relationship between the head word of the two immediately preceding partial analysis trees and the predicted word is not taken into account. After the word is predicted, the sentence structure that includes the word is updated. Therefore, the accuracy of the prediction can be improved compared with the tri-gram method, which employs two immediately preceding words to predict a following word. However, in the model proposed by Chelba and Jelinek, a word is predicted by referring to the head word of the two immediately preceding partial analysis trees, regardless of how the words are modified, so that, depending on the sentence structure, the accuracy of the prediction may be reduced. This will be explained by referring to the phrase “sara-ni sho-senkyoku no”, used for the tri-gram model.
As is shown in FIGS. 10A to 10C, the phrase “sara-ni sho-senkyoku no” is constituted by two partial analysis trees, and the head word of the trees are “sara-ni” and “no”, which are enclosed by solid lines in FIG. 10A. Therefore, according to the method proposed by Chelba and Jelinek, “sara-ni” and “no”, which are two immediately preceding head word as is shown in FIG. 10B, are employed to predict the next word “donyu”. When “donyu” is predicted, as is shown in FIG. 10C, the sentence structure including “donyu” is predicted. In the prediction of the structure, the modification of words as indicted by arrows is included. Since “sara-ni” does not modify “donyu”, it is not only useless for the prediction of the word “donyu”, but also may tend to degrade the prediction accuracy.
For the phrase “nani-ga ima seiji-no saisei-no tame-ni”, in FIG. 11, the following prediction process is performed. This phase is constituted by three partial analysis trees “nani-ga”, “ima” and “seiji-no saisei-no tame-ni”, and the head word of the trees are “ga”, “ima” and “ni”. As indicated by the solid line enclosures in FIG. 11A, the two immediately preceding head word are “ima” and “ni”. Therefore, as is shown in FIG. 11B, “hitsuyo” is predicted by using “ima” and “ni”. And after “hitsuyo” is predicted, the sentence structure that includes “hitsuyo” is predicted, as is shown in FIG. 11C.
To predict a word, the modifications performed by words provides useful information. However, that “nani-ga” is a modifier is not taken into account. As is described above, according to the method proposed by Chelba and Jelinek, no consideration is given for information that is useful for prediction that frequently occurs.
A need therefore exists for a word prediction method and apparatus for employing such method that supplies improved prediction accuracy, and a speech recognition method and an apparatus therefor.