1. Field of the Invention
The present invention relates to a continuous speech recognition apparatus and method, and more particularly to a continuous speech recognition apparatus and method which achieves augmentation in speed and accuracy of recognition.
2. Description of the Related Art
As an example of a conventional continuous speech recognition apparatus, reference is had to a paper by S. Ortmanns, xe2x80x9cLANGUAGE-MODEL LOOK-AHEAD FOR LARGE VOCABULARY SPEECH RECOGNITIONxe2x80x9d, ICSLP, 1996.
The conventional continuous speech recognition apparatus is shown in FIG. 6. Referring to FIG. 6, the conventional continuous speech recognition apparatus shown includes a hypothesis storage section 1, a hypothesis expansion section 3, a tree structure dictionary storage section 4, a language model section 7, and an acoustic model section 8.
In operation, the hypothesis storage section 1 stores hypotheses therein. The tree structure dictionary storage section 4 stores words, which make an object of recognition, as a tree structure dictionary (refer to FIG. 2). The acoustic model section 8 calculates an acoustic model score for each frame. The language model section 7 calculates a language model score.
The hypothesis expansion section 3 acquires, for each frame, a structure of arcs from the tree structure dictionary storage section 4 taking an acoustic model from the acoustic model section 8 and a language model score from the language model section 7 into consideration and expands a hypothesis present on an arc to a succeeding arc. Referring to FIG. 2, a tree structure dictionary is structured such that a word is reached by tracing arcs branching in a tree structure from a root to a leaf (terminal arc).
Speech which makes an object of recognition is divided into short-time frames of a predetermined period, and such expansion as described above (that is, expansion of a hypothesis on an arc of a tree structure dictionary to a succeeding arc) is repeated from the speech beginning frame to the speech terminating frame. Then, a word through which a hypothesis which exhibits the highest score has passed in the past (a terminal of the tree structure dictionary) is finally determined as a recognition result.
Here, a hypothesis has position information of an arc on a tree structure dictionary, a history until the position is reached, and a score.
In a continuous speech recognition system wherein a plurality of words are represented as one tree structure dictionary (refer to FIG. 2), what is a word with regard to which a hypothesis is being expanded at present cannot be specified except at the terminal arc.
Therefore, although an acoustic model score is calculated for each frame, a language model score can originally be determined only when a hypothesis reaches a terminal arc of a tree structure dictionary.
Therefore, in order to add a language model score as early as possible, a method employing look-ahead of a unigram language model score and look-ahead of a bigram language model score is disclosed in the document mentioned hereinabove.
According to the look-ahead of a unigram language model score, the highest one of unigram language model scores of words settled at terminal arcs in a tree structure dictionary is provided to a predecessor arc, and the unigram language model provided to the arc is temporarily added as a language model score of the hypothesis present on the arc, and then, when the hypothesis reaches the terminal arc of the tree structure dictionary and the word is settled, the unigram language model score which has been used till then is abandoned and then the settled bigram language model score is added.
On the other hand, according to the look-ahead of a bigram language model score, when a context is determined and a new tree structure dictionary is produced, bigram language model scores regarding all words of the context are calculated, and that one of the language model scores which has the highest score is provided to a predecessor arc, and then the bigram language score provided to the arc is added as a language model score of the hypothesis present on a certain arc.
The conventional speech recognition system has the following problems.
The first problem resides in that, when look-ahead of a bigram language model score is performed, a great memory capacity and a large amount of calculation are required.
The reason is that, where look-ahead of a bigram language model score is performed, when a context is produced and a tree structure dictionary is produced newly, it is required to repeat processing of producing not part of a tree structure dictionary but an entire tree structure dictionary, calculating all bigram language model scores with respect to the context and provide language model scores of all terminal arcs in the tree structure dictionary, with which words are settled, to a predecessor arc to propagate the language model scores to all predecessor arcs.
The second problem resides in that, when look-ahead of a unigram language model score is performed, wasteful calculation is performed.
The reason is that, when look-ahead of a unigram language model score is performed, some of arcs of a tree structure dictionary may expand only to a word whose connection to the context is not permitted linguistically and the hypothesis is expanded also to such arc, in which wasteful calculation is involved.
The third problem is such as follows. If strict look-ahead of a language model score of a bigram or more is not performed using a frame synchronous beam search (for the frame synchronous beam search, for example, Hermann Ney, xe2x80x9cData Driven Search Organization for Continuous Speech Recognitionxe2x80x9d, IEEE TRANSACTIONS ON SIGNAL PROCESSING, February, 1992 is referred to), that is, if connection possibility according to linguistic restrictions between a context and a word in a tree structure dictionary is not looked ahead, then the hypothesis is expanded also to an arc which is developed to a word whose connection to a context is not permitted linguistically as described above in connection with the second problem.
Then, if the score of the hypothesis is much higher than the others, then all hypotheses on an arc which is developed to a word whose connection to the context is permitted linguistically are excluded from the beam and thus eliminated.
As a result, in the succeeding frames, the word cannot be connected to a next word at all, and recognition processing for speech uttered later is disabled. In other words, recognition processing cannot be performed any more and a recognition result cannot be outputted.
It is an object of the present invention to provide a continuous speech recognition apparatus and method by which the recognition speed and the recognition accuracy in continuous speech recognition can be augmented.
In order to attain the object described above, according to an aspect of the present invention, there is provided a continuous speech recognition apparatus, comprising a hypothesis storage section for storing hypotheses therein, hypothesis expansion discrimination means for determining whether or not a hypothesis may be expanded to a succeeding arc, a tree structure dictionary storage section for storing a tree structure dictionary and a context preceding to the tree structure dictionary therein, a succeeding word speech part information storage section for storing information of whether or not speech parts are included in all of succeeding words present behind each of arcs in the tree structure dictionary, a speech part connection information storage section for storing connection information between the speech parts, means for providing a language model score to a hypothesis, means for providing an acoustic model score to a hypothesis, and hypothesis expansion means operable in response to an expansion instruction received from the hypothesis expansion discrimination means for acquiring a structure of an arc from the tree structure dictionary storage section and expanding a hypothesis present on the arc to a succeeding arc taking the acoustic model score and the language model score into consideration and then storing a result of the expansion into the hypothesis storage section.
According to another aspect of the present invention, there is provided a continuous speech recognition method for a continuous speech recognition apparatus which includes a hypothesis storage section for storing hypotheses therein, a tree structure dictionary storage section for storing a tree structure dictionary and a context preceding to the tree structure dictionary therein, a succeeding word speech part information storage section for storing information of whether or not speech parts are included in all of succeeding words present behind each of arcs in the tree structure dictionary, and a speech part connection information storage section for storing connection information between the speech parts, comprising the step of repeating a process for all of hypotheses present at a certain frame time, the process including the steps of acquiring a context of a tree structure dictionary to which a hypothesis belongs from the tree structure dictionary storage section, acquiring speech part connection information of the speech parts of the context from the speech part connection information storage section, acquiring arcs in the tree structure dictionary to which the hypothesis belongs from the hypothesis storage section, and repeating, for all succeeding arcs immediately succeeding the arcs, a process including the steps of acquiring, where an arc selected at present is represented as first arc and a succeeding arc immediately succeeding the first arc is represented as second arc, succeeding work speech part information of the second arc from the succeeding word speech part information storage section, discriminating from the acquired speech part connection information and the acquired succeeding word speech part information whether or not the hypothesis may be expanded from the first arc to the second arc and determining that the hypothesis must not be expanded to the second arc if a connectable speech part included in the speech part connection information is not detected behind the second arc, but determining otherwise that the hypothesis may be expanded to the second arc, expanding the hypothesis to the second arc. and discriminating whether or not the loop has been completed for all of the hypotheses and ending, when the loop has been completed for all of the hypotheses, the expanding processing of the hypotheses of the frame in a frame synchronous beam search.
With the continuous speech recognition apparatus and method, a hypothesis is prevented from being expanded to an arc to a word which cannot connect linguistically to a context. Consequently, the number of unnecessary hypotheses is minimized and the speed of continuous speech recognition is augmented as much. Further, occurrence of a situation that the score of a hypothesis to a word which cannot connect linguistically to a context is higher than those of the other hypotheses is prevented. Consequently, the recognition accuracy in continuous speech recognition is augmented.
The above and other objects, features and advantages of the present invention will become apparent from the following description and the appended claims, taken in conjunction with the accompanying drawings in which like parts or elements are denoted by like reference symbols.