This is a continuation-in-part application of an earlier application titled "APPARATUS FOR SPEECH RECOGNITION" filed Sept. 4, 1984 by the present applicants (Serial Number being unknown), claiming priority of three Japanese Patent applications filed Sept. 5, 1983, July 27, 1984 and Aug. 16, 1984.
This invention relates generally to a method for speech recognition, and more particularly to such a method for recognizing speech by way of phoneme recognition.
Apparatus for speech recognition which automatically recognizes spoken words is an extremely useful measure for supplying computers or apparatus with various data and instructions. In prior speech recognition apparatus, a pattern-matching method is usually adopted as an operating principle. According to this method, various standard patterns are prepared and prestored in a memory in advance in connection with all words to be recognized, and the degree of similarity between an input unknown pattern and the standard patterns is computed for determining that the input pattern data is the same word as the word whose similarity is determined to be the highest. In this pattern-matching method, since it is necessary to prepare standard patterns of all words to be recognized, new standard patterns have to be inputted and stored for each individual speaker. Therefore, to recognize more than several hundreds of words, it is time-consuming and troublesome to register all these words spoken by each individual speaker. Furthermore, a memory used for storing such dictionary of spoken words must have an extremely large capacity. Moreover, this method suffers from a drawback that it takes a long period of time for effecting matching comparison between an input pattern and the standard patterns as the number of words in the word dictionary increases.
Another speech recognition method determines similarity between words prestored in a word dictionary in the form of phonemes, and input sounds which are recognized as a combination of phonemes. In the phoneme method, the capacity of the memory required for the word dictionary is reduced, the period of time required for pattern matching comparison is shortened, and the contents of the word dictionary can be readily changed. For instance, since a sound "AKAI" can be expressed by way of a simple form of "a k a i" with three different phonemes /a/, /k/ and /i/ being combined, it is easy to handle a number of spoken words emitted from unspecific speakers.
In speech recognition for unspecific speaking persons (speaker independent systems), since the characteristics of sounds drastically change depending on sex distinction and age difference, a problem to be solved is how to classify various characteristics of sounds so as to recognize words spoken by unspecific persons. Namely, in the case of recognizing speech in the units of phoneme, phoneme standard patterns suffer from dispersion due to sex distinction and age difference of speaking persons since the shape of spectrum of a vowel /a/ changes drastically depending on sex.
Therefore, the most important point to be considered in recognition of speech spoken from unspecific persons is to obtain a high speech recognition rate for any speaking persons with any acoustic environment with stability. For obtaining such high speech recognition rate, speaking persons using the system should be prevented from shouldering an execessive burden, while the apparatus for speech recognition should not require high-cost portions. However, these points have been insufficient hitherto in speech recognition apparatus proposed or produced as a trial.
In a method of using a predicted error disclosed in "EVALUATION OF LPC DISTANCE MEASURE AIMING RECOGNITION 0F VOWELS IN SPEECH IN CONVERSATION" by Shikano and Koda in Transactions of The Institute of Electronic & Communication Engineers of Japan, VOL J-63D, No. 5, May, 1980, a predicted error is obtained by the following formula with the most similar parameter A.sub.ij (j=1, 2, . . . , p, wherein p is the order of analysis) of a phoneme i being obtained by way of linear prediction analysis using speech sounds of a number of speaking persons: ##EQU1## wherein S.sub.j is an autocorrelation coefficient obtained from unknown input speech sounds.
The predicted error N.sub.i is obtained for each phoneme, which is an objective, as a distance measure, and a phoneme causing the smallest value of N.sub.i is determined as the result of recognition.
However, since the most similar parameter A.sub.ij corresponding to a standard pattern of a phoneme is just an average value in this method, it is impossible to deal with sound variation due to co-articulation even though a learning function is provided for producing A.sub.ij again to make it suitable for a present speaking person, and therefore the above-mentioned method suffers from a low recognition rate.
Furthermore, the method has a drawback that the recognition rate cannot be increased because phonemes of vowels and semivowels are determined by way of standard patterns in units of frames so as to effect segmentation and phoneme recognition as combinations of determined results, and therefore time-dependent variation cannot be sufficiently captured.
To compensate for the above-mentioned defects another method has been tried such that a number of standard patterns corresponding to a number of speaking persons are provided for each phoneme, and similarity calculation is executed in connection with all the standard patterns for input speech sounds so as to determine which standard pattern shows the highest similarity or to calculate similarity to phoneme standard patterns paying attention to time-dependent variation of the sepectrum of consonants, semivowels and contracted sounds of unknown input speech sounds. However, this method requires an enormous amount of calculations and results in high cost of the speech recognition apparatus.