The present invention relates to advantageous aspects of an improved speech recognizer. More particularly, such a speech recognizer can be used for voice operated name or digit dialing in a telephone device such as a cellular or cordless phone, or also for voice controlled operation of the telephone device generally or any other suitable voice controlled device. The speech recognizer can be trained for speaker independent or speaker dependent operation. In case of a telephony system, the speech recognizer can also be incorporated at the network side of the system, in the network itself or at a particular service providing node of the network.
The article xe2x80x9cA Tutorial on Hidden Markov Models and Selected Applications in Speech Recognitionxe2x80x9d, Laurence R. Rabiner, Proceedings of the IEEE, Vol. 77, No. 2, February 1989, pp. 257-286 discloses basic principles of speech recognition, and general implementations of speech recognizers for recognition of continuous speech, or for recognition of isolated words or clusters of connected words. This article is incorporated by reference herein in its entirety. Speech recognition of continuous speech, typically involving a large vocabulary of words, is a complicated process, whereas recognition of isolated words, used for small vocabulary word recognition in voice driven control of apparatuses, or, in telephony applications for name or digital dialing and/or control, for instance, is a much simpler process to be implemented.
In said article, methods are disclosed for modeling speech signals to be recognized into statistical models, and, after training of these models, for recognizing unknown speech signals by matching such unknown signals to the models and by taking a statistically most likely model as a recognition result. The recognition result can be used for many applications, including voice command driven operation of apparatuses and, in telephones, name dialing. As described in said article, the speech recognition process is based upon statistically matching feature vectors, representing samples of a speech signal to be recognized, with speech signal models. Such models are statistical models of speech signals, i.e., are models which characterize the statistical properties of the speech signals, e.g., in the form of so-called Hidden Markov Models, HMMs, as described in detail in said article.
Hidden Markov Models are probabilistic functions of so-called Markov chains of states, representing a real-world speech signal as a parametrical stochastic process, the parameters initially being obtained by training of the model using a succession of the same known utterance, for instance. In such a HMM, used in an isolated word speech recognizer, for instance, an observation representing a speech signal is a probabilistic function of the states of the Markov chain, i.e., the resulting model is a doubly embedded stochastic process with an underlying stochastic process that is not observable (hidden), but can only be observed through another set of stochastic processes that produce the sequence of observations. Observations can represent spectral and/or temporal characteristics of a speech signal. A spectral method for obtaining an observation sequence or feature vector sequence from speech samples is so-called LPC/Cepstral feature analysis, as described on page 227 of said article. Typically, a feature vector comprises some 25 different characteristics characterizing the speech signal to be recognized. In the speech recognizer, given an observation sequence, or vector, derived from an unknown input utterance, and the models, probabilities are computed for such an observation sequence as regards all models, i.e., scores are determined. The model with the best score, in terms of likelihood, is selected as a tentative recognition result, which can either be accepted or rejected. While determining scores, using a Viterbi algorithm, model state transitions are saved for later use in a back tracking process for determining a corresponding state sequence best explaining the observation sequence. For retraining or re-estimating the model, the observation sequence and the saved model state transitions are used. The models can be retrained during normal operation of a speech recognizer, i.e., based upon speech signals to be recognized and recognition transcripts. Thus, the models can be improved in a statistical sense so as to improve speech recognition. A transcript, sometimes called a label is the verbal content of an utterance, or an index designation of such a content.
In the ATandT Owner""s Manual xe2x80x9cATandT 3050/3450 Cellular Telephonexe2x80x9d, pages 59-65, published 1993, Voice Recognition Features of a Voice Dialer are described, such as training the voice dialer before use, of both voice commands for operating the voice dialer and for constructing a voice dial name list.
In the PCT Patent Application WO 98/25393, a voice dialer is disclosed. A telephone directory of names and telephone numbers which a user frequently dials is maintained. A speech recognizer is used to select records from the directory. The speech recognizer performs similarity measurements by comparing confidence metrics derived from an input representation of a spoken utterance and from a previous representation stored in a record of the directory. As described on page 22, lines 5-14, the user""s own dialing habits are used to improve selection of the desired record, and to organize and update a telephone directory. In this respect, the user""s calling habits are registered by recording in a frequency field the number of times the user calls a particular number. Furthermore, adding and deleting of records is controlled by frequently prompting the user, to speak again or to cancel, for instance, so as to avoid possible errors in the telephone directory.
In other speech recognition systems, the reliability of the tentative recognition result, obtained by selecting the statistically most likely model in the matching process, is tested. If in a number of tests most or all tests are passed, the tentative recognition result, which is rejected otherwise, is accepted as the final recognition result. One known test is the testing of a so-called anti-model, an anti-model being a model which represents a recognition outcome which can easily be statistically confused with a model of interest. If in such a test the anti-model does not score significantly worse than the model of interest, a user input utterance under test is rejected. In some other systems, this step is optional, and the tentative recognition result becomes the final recognition result without any reliability testing.
Furthermore, as regards training of speech recognition models, it is well-known that in order to get good performance of a set of models, both for sets trained with utterances of many speakers, so-called speaker independent training, and for sets trained with utterances of an individual user, so-called speaker dependent training, the training set should be sufficiently large and preferably should include the conditions under which the models are to be used during normal operation of the speech recognizer, such conditions including ambient noise, transducer and channel distortion, the user""s emotional state, and so on. In speaker independent training, the training set should include utterances from speakers having various pronunciations of a particular language. In practice, it is difficult to fulfill all of the above requirements. Collecting training sets for speaker independent training is expensive, tedious and time consuming. Collecting more than some two utterances per word for speaker dependent recognition is considered by many users as an imposition. Furthermore, the pronunciation of a single user may vary with time. Moreover, it is difficult to foresee all the different noise and channel conditions in which the speech recognizer will be used in practice, and to include such conditions in the training set. In addition, the emotional state of people training the system, while being requested to repeat words and sentences, is different from their emotional state when making a call.
In order to mitigate the above problems, some speech recognizers apply model adaptation in which a speech model is adapted or retrained using additional training speech material, obtained under normal operational conditions of the speech recognizer. Model adaptation has been used in voice dictation applications, in which the user""s dictation input becomes the retraining material, and the recognized text, as modified by the user via a keyboard using standard text processor techniques, is used as a transcipt, e.g. an index in a word table entry. A key point in this method is the fact that, after the user""s editing, the text is a correct transcript of the user""s utterance. If such a modification step, done by the user, is omitted, and the recognition results are used as the transcript, then, if the recognition results were wrong, the retraining process would degrade the models rather than improving them. Therefore, for effective retraining, it is essential to obtain a reliable transcript of the speech material to be used for retraining.
It is an object of the invention to provide a speech recognition method in which a reliable transcript of a user utterance is available when retraining a speech model during normal use of a speech recognizer.
It is another object of the invention to provide a speech recognition method in which weighting factors of model parameters and estimates of model parameters are selectable.
It is still another object of the invention to provide a speech recognizer for use in a telephone or in a network for voice dialing by a user of the telephone.
It is yet another object of the invention to provide a speech recognizer for use in a telephone for voice command purposes.
In accordance with one aspect of the invention, a method for automatic retraining of a speech recognizer during its normal operation is provided, in which speech recognizer a plurality of trained models is stored, the method comprising:
a) extracting a first feature vector sequence from a sampled input stream of a first user utterance,
b) statistically matching the first feature vector sequence with the stored models so as to obtain likelihood scores for each model, while storing model state transitions,
c) identifying the model with the highest likelihood score as a first tentative recognition result,
d) storing the first feature vector sequence and the first tentative recognition result,
e) informing the user, upon acceptance of the first tentative recognition result, about the first tentative recognition result,
f) determining from the user""s behavior, when the user successively operates the speech recognizer, whether the first tentative recognition result was correct, and
g) retraining a model corresponding to the first recognition result, using the stored first feature vector sequence, if the first tentative recognition result was correct.
In part, the invention is based upon the insight that speech recognition results themselves cannot be considered to be reliable. Had they been so, there would have been no need for adaptation. So, it was recognized that there was a need for a supervised adaptation based on the determination of a reliable transcript, whereby a tentative recognition result is accepted or rejected on the basis of the user""s subsequent behavior.
A substantial advantage of this approach is that no or only minimal involvement on the part of the user is required. Moreover, a tentative recognition result which was first apparently rejected by the user, still gets a chance of being used to update a speech model, dependent on the subsequent behavior of the user upon the initial rejection. This leads to a great overall improvement of the speech models because of improvement of statistical recognition properties of the models. In terms of variance, this means that the variance of a model becomes larger so that an input utterance has a smaller chance of an unjustified rejection.
In another aspect, the method for automatic retraining of a speech recognizer preferably comprises weighted retraining of the model, a first weighting factor being applied to stored model parameters and a second weighting factor being applied to an estimate of the model parameters obtained from the first or second tentative recognition result. If many utterances have been applied for retraining already, the model parameters are given a much higher weighting factor than to the estimate used for updating the model. This weighting is done because the model is already closer to optimum so that no severe retraining is needed. Also, when uncertainty exists about a transcript because of an initial rejection of a recognition result, the estimate""s weighting factor is given a lower weighting factor.
Preferably, for user independently trained and retrained models, such models contain a mixture of parameters describing joint probability distributions for each model. In a country with many dialects, use of such models greatly improves the chance of proper recognition.
The invention can advantageously be used for voice dialing or for voice command recognition in a telephone.