This invention is related to automatic speech recognition (ASR), in particular to methods to perform an unsupervised or on-line adaption of an automatic speech recognition system and to a speech recognition system being able to carry out the inventive methods.
State of the art speech recognizers consist of a set of statistical distribustions modeling the acoustic properties of certain speech segments. These acoustic properties are encoded in feature vectors. As an example, one Gaussian distribution can be taken for each phoneme. These distributions are attached to states. A (stochastic) state transition network (usually Hidden Markov Models) defines the probabilities for sequences of states and sequences of feature vectors. Passing a state consumes one feature vector covering a frame of e.g. 10 ms of the speech signal.
The stochastic parameters of such a recognizer are trained using a large amount of speech data either from a single speaker yielding a speaker dependent (SD) system or from many speakers yielding a speaker independent (SI) system.
Speaker adaptation (SA) is a widely used method to increase recognition rates of SI systems. State of the art speaker dependent systems yield much higher recognition rates than speaker independent systems. However, for many applications, it is not feasible to gather enough data from a single speaker to train the system. In case of a consumer device this might even not be wanted. To overcome this mismatch in recognition rates, speaker adaptation algorithms are widely used in order to achieve recognition rates that come close to speaker dependent systems, but only use a fraction of speaker dependent data compared to speaker dependent systems. These systems initially take speaker independent models that are then adapted so as to better match the speakers acoustics.
Usually, the adaptation is performed supervised. That is, words spoken are known and the recognizer is forced to recognize them. Herewith a time alignment of the segment-specific distributions is achieved. The mismatch between the actual feature vectors and the parameters of the corresponding distribution builds the basis for the adaptation. The supervised adaptation requires an adaptation session to be done with every new speaker before he/she can actually use the recognizer.
FIG. 5 shows a block diagram of such an exemplary speech recognition system according to the prior art. The spoken utterances received with a microphone 51 are converted into a digital signal in an A/D conversion stage 52 that is connected to a feature extraction module 53 in which a feature extraction is performed to obtain a feature vector e.g. every 10 ms. Such a feature vector is either used for training of a speech recognition system or after training it is used for adaptation of the initially speaker independent models and during use of the recognizer for the recognition of spoken utterances.
For training, the feature extraction module 53 is connected to a training module 55 via the contacts a and c of a switch 54. The training module 55 of the exemplary speech recognition system working with Hidden Markov Models (HMMs) obtains a set of speaker independent (SI) HMMs. This is usually performed by the manufacturer of the automatic speech recognition device using a large data base comprising many different speakers.
After the speech recognition system loads a set of SI models, the contacts a and b of the switch 54 are connected so that the feature vectors extracted by the feature extraction module 53 are fed into a recognition module 57 so that the system can be used by the customer and adapted to him/her. The recognition module 57 then calculates a recognition result based on the extracted feature vectors and the speaker independent model set. During the adaptation to an individual speaker the recognition module 57 is connected to an adaptation module 58 that calculates a speaker adapted model set to be stored in a storage 59. In the future, the recognition module 57 calculates the recognition result based on the extracted feature vector and the speaker adapted model set. A further adaptation of the speaker adapted model set can be repeatedly performed to further improve the performance of the system for specific speakers. There are several existing methods for speaker adaptation, such as maximum a posteriori adaptation (MAP) or maximum likelihood linear regression (MLLR) adaptation.
Usually, the speaker adaptation techniques modify the parameters of the Hidden Markov Models so that they better match the new speakers acoustics. As stated above, normally this is done in batch or off-line adaptation. This means that a speaker has to read a pre-defined text before he/she can use the system for recognition, which is then processed to do the adaptation. Once this is finished the system can be used for recognition. This mode is also called supervised adaptation, since the text was known to the system and a forced alignment of the corresponding speech signal to the models corresponding to the text is performed and used for adaptation.
However, an unsupervised or on-line method is better suited for most kinds of consumer devices. In this case, adaptation takes place while the system is in use. The recognized utterance is used for adaptation and the modified models are used for recognizing the next utterance and so on. In this case the spoken text is not known to the system, but the word(s) that were recognized are taken instead.
The EP 0 763 816 A2 proposes to use confidence measures as an optimization criterium for HMM training. These confidence measures are additional knowledge sources used for the classification of a recognition result as xe2x80x9cprobably correctxe2x80x9d or xe2x80x9cprobably incorrectxe2x80x9d. Here, confidence measures are used for verification of n best recognized word strings and the result of this verification procedure, i.e. the derivative of the loss function, is used as optimization criterium for the training of the models. In this case, all utterances are used for training and the method is used to maximize the difference in the likelihood of confusable words. However, this document relates only to HMM training prior to system use.
On the other hand, the EP 0 793 532 A2 discloses a method to correct misrecognition by uttering a predefined keyword xe2x80x9coopsxe2x80x9d whereafter the user might correct the misrecognized words by typing or the system tries to correct the error itself. In any case, the system only trains/adapts the speech models when a (series of) word(s) has been misrecognized.
The present invention is concerned with the adaptation of speaker independent Hidden Markov Models in speech recognition systems using unsupervised or on-line adaptation. In these systems the HMMs have to be steadily refined after each new utterance or even after parts of utterances. Furtheron, the words that come into the system are not repeated several times and are not known to the system. Therefore, only an incremental speaker adaptation is possible, i.e. only very little adaptation data is available at a time, and additionally the problem arises that misrecognitions occur depending on the performance of the speaker independent system, because the output of the recognition module has to be assumed to be the correct word. These words are then used for adaptation and if the word was misrecognized, the adaptation algorithm will modify the models in a wrong way. The recognition performance might decrease drastically when this happens repeatedly.
Therefore, it is the object underlying the present invention to propose a method and a device for unsupervised adaptation that overcome the problems described above in connection with the prior art.
The inventive methods are defined in independent claims 1 and 17 and the inventive device is defined in independent claim 23. Preferred embodiments thereof are respectively defined in the following dependent claims.
According to the invention, a kind of measurement indicates how reliable the recognition result was. The adaptation of the system is then based on the grade of the reliability of said recognition result. Therefore, this method according to the present invention is called semi-supervised speaker adaptation, since no supervising user or fixed set of vocabulary for adaptation is necessary.
In case of a reliable recognition an utterance can be used for adaptation to a particular speaker, but in case of an unreliable recognition the utterance is discarded to avoid a wrong modification of the models. Alternatively, depending on the grade of the reliability a weight can be calculated that determines the strength of the adaptation.