The present invention relates to a method for recognizing speech according to the preamble of claim 1, and in particular to a method for recognizing speech which avoids over-adaptation to certain words during online speaker adaptation.
Nowadays, methods and devices for automatic speech recognition have implemented a so-called online speaker adaptation process to make the methods and devices more flexible with respect to the large variability of possible speaking behaviour of the speakers.
In conventional methods for recognizing speech a current acoustic model is used for the process of recognition, in particular for a set of given speech phrases to be recognized within an incoming speech flow. The implemented current acoustic model contains information which is relevant for the recognition process per se, in particular for all potential speakers (speaker-independent recognition). To increase the recognition rate the acoustic models for recognizing speech are adapted during the recognition process based on at least a recognition result which is already obtained. Adaptation means to extract specific information which is necessary to focus on the particular voice characteristics of the current speaker. The process of adapting said current acoustic model is therefore based on an evaluation of speech phrase subunits which are contained in a speech phrase under process and/or recently recognized. Not only observed units but also unobserved units can be adapted. That means that the speech phrase subunit is evaluated with respect to the acoustical neighbourhood appearing in the evaluated utterance.
In applications of common methods and devices for recognizing speech it appears that based on the specific context in which the applied methods and devices have to work the speech input contains distinct speech phrases, words or sounds in certain contexts much more often than most other words. For example, in an application of a method for recognizing speech for a traffic information system phrases and words which are specific for distinct locations, ways to travel, means of transport, certain commands or the like occur much more often than other words in the vocabulary.
Conventional methods and devices for recognizing speech have the major drawback that they focus in the adapting process for the current acoustic model on each received speech phrase or word in an equivalent manner. Therefore, received speech phrases or words which do occur frequently influence the modification and adaptation on the current acoustic model much more than words or phrases which do occur infrequently.
As a result, after having applied conventional methods for adaptation these frequently occuring speech phrases or words are recognized with a very small error rate but the recognition rate for other vocabulary is worse.