The invention relates to a method for the speaker adaptive recognition of speech. Among others, an efficient speech recognition method must meet the following requirements: isolated words as well as a flowing speech text must be recognized. Even with very large vocabularies, recognition should take place in real time if possible. Fast adaptation to a new speaker is necessary. It should be possible to arbitrarily generate reference words and expand the vocabulary without (possibly repeated) sample-speaking of the added words. Variations in pronunciation of individual words must be able to be generated automatically and without explicit sample-speaking of these variants. In flowing speech, an analysis of overlapping word hypotheses should make possible the recognition of the spoken phrase.
The known methods of recognizing speech from a large vocabulary (IBM, Dragon, AT&T, BBN, Carnegie Mellon University (CMU)/Pittsburgh; overview article by F. Fallside, entitled "Progress in Large Vocabulary Speech Recognition," Speech Technology Vol. 4, number 4, (1989), pages 14-15), employ primarily hidden-Markov models based on phonemes. None of these systems includes an automatic vocabulary generation or expansion from written text. In the IBM and Dragon recognizers, the words must be spoken separately while the AT&T, BBN and CMU recognizers do not operate in a speaker adaptive manner.
Conventionally, each word--in the case of speaker dependent recognition--must be pronounced once or repeatedly by the user and--in the case of speaker independent recognition--must additionally be pronounced at least once by a very large number of speakers (order of magnitude from 100 to 1000). Such a complicated training procedure can be avoided if speaker adaptive methods are employed. With increasing vocabulary sizes it is necessary, with respect to speech recognition close to real time, to quickly and without extensive computation compile a short list of probably spoken "word candidates". From this sub-vocabulary of word candidates, the spoken words are then determined in the course of a fine analysis. Such a preselection is based on the classification of coarse features in word subunits, for example in individual feature vectors, phonemes or diphones. For separately spoken words--also from large vocabularies--and for sequences of digits (see F. R. Chen, "Lexical Access And Verification In A Broad Phonetic Approach To Continuous Digit Recognition", IEEE ICASSP (1986), pages 21.7.1-4; H. Lagger and A. Waibel, "A Coarse Phonetic Knowledge Source For Template Independent Large Vocabulary Word Recognition", IEEE ICASSP(2), (1985), pages 23.6.1-4; D. Lubensky and W. Feix, "Fast Feature-Based Preclassification Of Segments In Continuous Digit Recognition", IEEE ICASSP, (1986), pages 21.6.1-4), this constitutes a practicable method. However, for continuously voiced speech and a larger vocabulary, this leads to an unmanageable flood of hypotheses already for average vocabulary sizes since, in principle, a new word may start at any one of these small units and the entire supply of words would have to be searched for each unit. Two- or three-dimensional dynamic programming is known from G. Micca, R. Pieraccini and P. Laface, "Three-Dimensional DP For Phonetic Lattice Matching" , Int. Conf. on Dig. Signal Proc., (1987), Firence, Italy; and from G. Ruske and W. Weigel, "Dynamische Programmierung auf der Basis silbenorientierter Einheiten zur automatischen Erkennung gesprochener Satze" [Dynamic Programming Based On Syllable Oriented Units For The Automatic Recognition Of Spoken Sentences], NTG-Fachberichte 94, (1986), Sprachkommunikation [Speech Communication], pages 91-96.
In the prior art methods, the above-mentioned requirements are not met completely and sometimes not quite satisfactorily.