The present invention relates to a speech recognition apparatus, and more particularly to an apparatus capable of stable preliminary selection of candidate words with a high speed and a high accuracy.
In a speech recognition apparatus based upon Markov models, a preprocessing analyzes an inputted speech for a series of short constant time intervals hereinafter called frames (for example, about 12 milliseconds each), and generates a label string corresponding to the inputted speech. As the preliminary selection method adapted to the system, the Polling Fast Match method using one-state Markov models has been well known (see published Japanese Patent Application No. 62-220996 or U.S. Pat. No. 4,718,094). This method previously determines the probability of producing each label in the label alphabet (a label set) at an arbitrary frame of each word included in a vocabulary, accumulates the probability corresponding to each word in accordance with each label of the label string of the inputted speech to be recognized, and selects the candidate words from the vocabulary on the basis of the accumulated value for each word. These selected words are then more finely matched with the inputted speech.
Since this Polling Fast Match method, however, utilizes no time information, a word having an end portion similar to the head portion of the uttered word is erroneously judged as a candidate, resulting in degradation of recognition accuracy.
Other prior art is disclosed in papers entitled "Speaker Independent Isolated Word Recognition Using Label Histograms", by O. Watanuki and T. Kaneko (Proceedings of ICASSP '86, pp. 2679-2682, April, 1986), and "Experiments in Isolated Digit Recognition with a Cochlear Model", by Eric P. Loeb and Richard F. Lyon, (Proceedings of ICASSP '87, pp. 1131-1134, April, 1987.
In the former technique, the probability of producing each label in the label alphabet in an arbitrary frame of the block is previously determined for each of the N-divided blocks of the word included in the vocabulary, and the label string of the unknown inputted speech is divided into N sections. The probability for each word is accumulated in accordance with each label of the label string and the block including the label. The word having the maximum accumulated value is determined as the recognition word.
In the latter technique, similar processing with N=2 is carried out.
In these techniques, the division of the inputted speech into N or 2 sections cannot be performed until the completion of the inputted speech, making it difficult to perform real time processing. Furthermore, they are impaired by a fluctuation in the time direction since they have no smoothing processing in the time direction between the blocks.
It is to be noted here that the two techniques described above relate to speech recognition, and not to a preliminary selection of candidate words.