A high-performance speech recognition apparatus such as a large vocabulary continuous speech recognition apparatus calculates the acoustic similarity and the language similarity between various hypotheses (recognition candidates) predicted from three sources of knowledge of an acoustic model, a word dictionary, and a language model and an unknown input speech as an acoustic model score and a language model score, and outputs the most probable hypothesis as the recognition result. Further, in order to limit the number of hypotheses that are held in the apparatus to eliminate the calculation amount and the memory capacitance, the acoustic model score and the language model score at each time are comprehensively evaluated. Then, a hypothesis having poor score is pruned as being less probable, thus preventing the following hypotheses from being deployed. This method is called frame synchronous beam search method (hereinafter simply referred to as beam search method).
One example of the speech recognition apparatus is shown in FIG. 6. In FIG. 6, speech waveforms that are speech recognition targets are input to a speech input means 301, and are transmitted to an acoustic analysis means 302. The acoustic analysis means 302 calculates an acoustic feature amount by a unit of frame, and outputs the acoustic feature amount to a distance calculation means 303. The distance calculation means 303 calculates the distance between the input acoustic feature amount and each model in an acoustic model 304, and outputs an acoustic model score according to the distance to a searching means 305. The searching means 305 obtains an accumulated score by adding the acoustic model score and a language model score by a language model 402 obtained from a language model score look-ahead value imparting device 308 for all the hypotheses that are to be searched, and prunes the hypothesis having poor accumulated score. The remaining hypotheses are processed, and the optimal recognition result is output from a recognition result output means 309.
One example of a part of a word dictionary 403 is shown in FIG. 7. The word dictionary 403 in this example is a tree structure dictionary. Further, in FIG. 7, a language model score in each word given by the language model 402 is added. For example, a word “handshake” (Japanese pronunciation: “akusyu”) has a phoneme string of “a-k-u-sy-u”, and its language model score is 80. Further, a word “red” (Japanese pronunciation: “akai”) has a phoneme string of “a-k-a-i”, and its language model score is 50. In this example, smaller language model score indicates high score.
When such a tree structure dictionary is used, the root part of the tree structure is connected to the previous hypothesis in inter-word transition. However, since the connected word cannot be specified at this time, the language model score cannot be added to the accumulated score. If the language model score is added to the accumulated score for the first time when the hypothesis reaches any word end terminal, the scores greatly vary by the hypotheses around the inter-word transition. Accordingly, the beam width needs to be made large to prevent pruning even when the score of the correct answer hypothesis greatly varies, which inhibits efficient beam search.
In order to add the language model score as early as possible, the language model score look-ahead value imparting device 308 includes an optimal language model score acquisition means 401 that acquires the optimal value of the language model score of the word corresponding to each branch of the tree structure dictionary as the optimistic language model score in the branch.
More specifically, the optimal language model score acquisition means 401 acquires the optimal value of the language model score −log{p(w|h)} in a word w that belongs to the set of the word W(s) that can be traced from the phoneme s in the dictionary for the language model score look-ahead value πh(s) of the hypothesis of the phoneme s having the word history h using the word dictionary 403 and the language model 402 as shown in the expression (1). When the hypothesis transits to the phoneme s in the search process by the searching means 305, the difference value δh(s) between the language model score look-ahead value of the previous phoneme s˜ and the language model score look-ahead value of the current phoneme s shown in the expression (2) is added to the accumulated score of the hypothesis.πh(s)=min w∈W(s){−log p(w|h)}  (1)δh(s)=πh(s)−πh(s˜)   (2)
An example of the language model score look-ahead value given by the above operation is shown in FIG. 8. The right value of the end terminal phoneme indicates the language model score of each word, and the value in each branch indicates the language model score look-ahead difference value imparted to the branch. In this example, the language model score of 50 can be added to the accumulated score when the root part of the tree structure is connected to the previous hypothesis. Thus, efficient beam search can be performed compared with a case in which the language model score is added to the accumulated score for the first time when the hypothesis reaches the word end terminal.
The above optimal language model score acquisition means 401 is disclosed in Non-patent document 1. The Non-patent document 1 discloses two methods of look-ahead of a unigram language model score and that of a bigram language model score. The look-ahead of the unigram language model score uses the unigram language model score as the language model score look-ahead difference value. In this method, when the hypothesis reaches the word end terminal of the tree structure dictionary and the word is defined, the unigram language model score that has been used is discarded, and the defined bigram language model score is added. This processing that is performed when the hypothesis reaches the word end terminal is called word end processing. On the other hand, the look-ahead of the bigram language model score uses the bigram language model score from the step of look-ahead. The searching means 305 shown in FIG. 6 includes a word end processing means 307 in addition to a original searching means 306 that performs original search, and corresponds to the example that uses the look-ahead method of the unigram language model score.