Conventionally, there has been a method and apparatus for speech recognition of, e.g. “Hermann Ney: Data Driven Search Organization for Continuous Speech Recognition (IEEE TRANSACTIONS ON SIGNAL PROCESSING Vol. 40 No. 2 p272 1992)”.
FIG. 8 is a process flow of a speech recognition system as a related art. The process steps shown in the figure are executed synchronously with the frame of an input utterance. By executing to the end of the input utterance, a hypothesis approximate to the input utterance is obtained as a result of recognition. The search employing such a method is referred to as a frame synchronization beam search. Explanation is made below on each of the steps.
Using the one-pass search algorithm, a hypothesis is established on the i-th frame of an input utterance and developed in the (i+1)-th frame. If the hypothesis is within a word, an utterance segment is used to express the word. Otherwise, if the hypothesis is at a word end, a word to follow is joined according to an inter-word connection rule. This extends the first utterance segment. The hypothesis on the i-th frame is erased to store only the (i+1)-th hypothesis (step S801).
Next, among the hypotheses developed in the (i+1)-th frame by step S801, the hypothesis highest in the score accumulated up to the (i+1)-th frame (hereinafter, referred to as cumulative score) is taken as a reference. Stored are only the hypotheses having a score within a constant threshold with respect to the score while the other hypotheses than that are erased. This is referred to as narrowing the candidates. The narrowing avoids the number of hypotheses from increasing in an exponential fashion and hence becoming impossible to compute. (step S802)
Next, the process moves to the next frames that is “+1”-ed to the current frame i. Determination is made as to whether it is the last frame. If it is the last frame, the process is ended. If it is not the last frame, the process moves again to step 1. (step S803)
As in the foregoing, the related-art method narrows down the hypothetic candidates depending only upon whether the cumulative score is within a threshold or not.
Incidentally, there is, e.g. Japanese Patent Laid-Open No. 6588/1996 as a speech recognition method to accurately evaluate hypotheses in the frame synchronization beam search. The speech recognition method described in this publication shows the computation for normalization against time in the frame synchronization beam search. Namely, the score on a hypothesis at time t is subtracted by the common likelihood function to all the hypotheses. Then, stored is a maximum value of the normalized score and hypothesis having a score normalized within a constant threshold with respect to the maximum value.
In the related-art speech recognition system, however, the hypothesis within a word or at a word end takes as a reference a hypothesis highest in cumulative score as noted above, to store a hypothesis having a score within a constant threshold with respect to the score. Consequently, at the word end there are a number of connectable word candidates to follow, thus incurring great increase in the number of hypotheses. As a result, there has been a setback to difficult computation in selecting hypothetic candidates.
The present invention has been made to solve the problem. It is an object to provide a method and apparatus for speech recognition capable of effectively reducing the computation amount in selecting hypothetic candidates while securing the accuracy of speech recognition.