The present invention relates to speech recognition and more particularly to a method for adjusting the importance or relative weight to be given to the scores provided by each of a plurality of systems or comparative information sources analyzing continuous speech. The scoring systems or information sources are referred to herein as "experts".
As is understood by those skilled in the art, the accuracy of word recognition by an acoustic speech analyzer can be improved by the application or utilization of contextual knowledge. In other words, selection of a most likely candidate from a predetermined vocabulary of word models to match an unknown speech segment can be improved by considering whether it is likely that a given candidate would properly follow a previous word or words in the sequence of speech being analyzed. Further, multiple linguistic systems or analyses may be applied, e.g. a first utilizing only the preceding word and another utilizing two or three preceding words. Each of these systems in the context of the present invention may be considered to be an "expert". Likewise, since there are various ways of performing acoustic analysis of input speech, there may be more than one acoustic expert, e.g. the three so-called codebooks utilized by the well known Sphinx recognizer system. In general, it is conventional for each such expert to return a score which represents the likelihood that the unknown speech matches a given model in the vocabulary of the recognizer. These scores are commonly presented as minus log probabilities.
In order to combine the scores for the several expert systems so as to arrive at an overall likelihood of match, some estimate for determining the relative importance or weighting for the different experts must be provided. Often, this weighting is determined essentially arbitrarily, based upon the system designer's experience. If the scores were each good, independent probability estimates, they could be combined simply by multiplication. Typically, however, the scores are not independent but rather are interrelated.
In the case of an isolated word recognizer, statistical methods can be applied to improve the relative weighting provided there exists a training database containing acoustic samples of the various individual words to be recognized. As will be understood, however, spoken samples generated in an isolated word context will not be most accurate for continuous speech recognition since, in continuous speech, there will be a substantial degree of co-articulation or interaction between the words spoken in a phrase. It is not, however, feasible to build a vocabulary of phrases since the database would be almost boundless to encompass all reasonable permutations of the individual words contemplated in a useful vocabulary. Thus, the recognizer's vocabulary must of necessity be made up of individual words.
The present invention relates to a method of adjusting the relative weighting to be applied to scores generated by multiple experts in a continuous speech recognizer where training data for this purpose is obtained in the form of multiple word phrases rather than isolated words. The application of numerical methods to this problem, however, is not straightforward or obvious since, in the context of a multi-word phrase, a correct individual word can be part of both correct multi-word hypotheses and incorrect multi-word hypotheses.
As indicated previously, it is not feasible to build a vocabulary of phrases to be recognized but, rather, even in continuous speech recognition, the recognizer is essentially constrained to proceed on a word by word or sound by sound basis and each expert employed in the overall system will provide scores representing a likelihood of match between each unknown speech segment and the word models in the system's vocabulary. In this context, the term "word" is used in a generic sense so as to encompass sub word fragments such as the mora which make up the Japanese spoken language or phoneme or syllables of the English language.
The method of the present invention is based in large part on the recognition that an objective function can be derived which can compare or analyze the cumulative scores of a plurality of multi-word hypotheses prepounded by the multiple experts and that the relative weighting coefficients for the several experts can be systematically adjusted to maximize the performance of the objective function and thereby improve the recognition accuracy of the system.