The present invention relates to a method for recognizing speech commands, in which a group of command words selectable by speech commands are defined, a time window is defined, within which the recognition of the speech command is performed, and a first recognition stage is performed, in which the recognition result of the first recognition stage is selected.
The present invention also contemplates a speech recognition device in which a vocabulary of selectable command words is defined. The device includes means for measuring the time used for recognition and comparing it with a predetermined time window, and means for selecting a first recognition result.
The present invention also includes a wireless communication device to be controlled by speech comprising means for recognizing speech commands, in which a vocabulary of selectable command words is defined, the means for recognizing speech commands comprising means for measuring the time used for recognition and comparing it with a predetermined time window, and means for selecting a first recognition result.
For facilitating the use of wireless communication devices, so-called hands free devices have been developed, whereby the wireless communication device can be controlled by speech. Thus, speech can be used to control different functions of the wireless communication device, such as turning on/off, transmission/reception, control of sound volume, selection of telephone number, or answering a call, whereby particularly in the use in a vehicle, it is easier for the user to concentrate on the driving.
One drawback in a wireless communication device controlled by speech is that speech recognition is not fully faultless. In a car, the background noise caused by the environment has a high volume, thereby making it difficult to recognize speech. Due to the unreliability of the speech recognition, users of wireless communication devices have so far shown relatively little interest in the control by speech. The recognition capability of present speech recognizers is not particularly good, especially under difficult conditions, such as in a moving car, where the high volume of background noise hampers reliable recognition of words substantially. Incorrect recognition decisions cause most problems usually in the implementation of the user interface, because incorrect recognition decisions may start undesired functions, such as terminating a call during a call, which is naturally particularly disturbing to the user. One result of an incorrect recognition decision may be that a call is connected to an incorrect number. For this reason, the user interface is designed in such way that the user usually is asked to repeat a command if the speech recognizer does not have sufficient certainty of a word uttered by the user.
Almost all speech recognition devices are based on the functional principle that a word uttered by the user is compared, by an usually rather complicated method, with a group of reference words previously stored in the memory of the speech recognition device. Speech recognition devices usually calculate a figure for each reference word to describe how much the word uttered by the user resembles the reference word. The recognition decision is finally made on the basis of these figures so that the decision is to select the reference word which the uttered word resembles most. The best known methods in the comparison between the uttered word and the reference words are dynamic time warping (DTW) and the statistical hidden Markov model (HMM).
In both the DTW and the HMM methods, an unknown speech pattern is compared with known reference patterns. In dynamic time warping, the speech pattern is divided into several frames, and the local distance between the speech pattern included in each frame and the corresponding speech segment of the reference pattern is calculated. This distance is calculated by comparing the speech segment and the corresponding speech segment of the reference pattern with each other, and it is thus a kind of numerical value for the differences found in the comparison. For speech segments close to each other, a smaller distance is usually obtained than for speech segments further from each other. On the basis of local distances obtained this way, a minimum path between the beginning and end points of the word are sought by using a DTW algorithm. Thus, by dynamic time warping, a distance is obtained between the uttered word and the reference word. In the HMM method, speech patterns are produced, and this stage of speech pattern generating is modelled with a state change model according to the Markov method. The state change model in question is thus the HMM. In this case, speech recognition on received speech patterns is performed by defining a observation probability on the speech patterns according to the hidden Markov model. In speech recognition by using the HMM method, an HMM model is first formed for each word to be recognized, i.e. for each reference word. These HMM models are stored in the memory of the speech recognition device. When the speech recognition device receives the speech pattern, a observation probability is calculated for each HMM model in the memory, and as the recognition result, a counterpart word is obtained for the HMM model with the greatest observation probability. Thus for each reference word the probability is calculated that it is the word uttered by the user. The above-mentioned greatest observation probability describes the resemblance of the received speech pattern and the closest HMM model, i.e. the closest reference speech pattern.
Thus, in present systems the speech recognition device calculates a certain figure for the reference words on the basis of the word uttered by the user. In the DTW method, the figure is the distance between the words, and in the HMM method, the figure is the probability for the equality of the uttered word and the HMM model. When the HMM method is used, the speech recognition devices are usually set a certain threshold probability which the most probable reference word must achieve to make the recognition decision. Another factor affecting the recognition decision can be e.g. the difference between the probabilities of the most probable and the second probable word, which must be sufficiently great to make the recognition decision. Thus, it is possible that when the background noise has a high volume, on the basis of a command uttered by the user, the reference word in the memory, e.g. the reference word xe2x80x9cyesxe2x80x9d, obtains at each attempt the greatest probability in relation to the other reference words, e.g. the probability 0.8. If the threshold probability is for example 0.9, the recognition is not accepted and the user may have to utter the command several times until the recognition probability threshold is exceeded and the speech recognition device accepts the command, even though the probability may have been very close to the acceptable value. This is very disturbing to the user.
Furthermore, the speech recognition is hampered by the fact that different users utter the same words in different ways, wherein the speech recognition device works better when used by one user than when used by another user. In practice, it is very difficult with the presently known techniques to adjust the certainty levels of speech recognition devices to consider all users. When adjusting the required certainty level e.g. for the word xe2x80x9cyesxe2x80x9d in speech recognition devices of prior art, the required threshold is typically set according to so-called worst speakers. Thus, the problem emerges that words close to the word xe2x80x9cyesxe2x80x9d also become incorrectly accepted. The problem is aggravated by the fact that in some situations, mere background noise may be recognized as command words. In speech recognition devices of prior art, the aim is to find a suitable balance in which a certain part of the users have great problems in having their words accepted and the number of incorrectly accepted words is sufficiently small. If the speech recognition device is adjusted in a way that a minimum number of users have problems in having their words accepted, this means in practice that the number of incorrectly accepted words will increase. Correspondingly, if the aim is set at as faultless a recognition as possible, an increasing number of users will have difficulties in having commands uttered by them accepted.
In speech recognition, errors are generally classified in three categories:
Insertion Error
The user says nothing but a command word is recognized in spite of this, or the user says a word which is not a command word and still a command word is recognized.
Deletion Error
The user says a command word but nothing is recognized.
Substitution Error
The command word uttered by the user is recognized as another command word.
In a theoretical optimum solution, the speech recognition device makes none of the above-mentioned errors. However, in practical situations, as was already presented above, the speech recognition device makes errors of all the said types. For usability of the user interface, it is important to design the speech recognition device in a way that the relative shares of the different error types are optimal. For example in speech activation, where a speech-activated device waits even for hours for a certain activation word, it is important that the device is not erroneously activated at random. Furthermore, it is important that the command words uttered by the user are recognized at good accuracy. In this case, however, it is more important that no erroneous activations take place. In practice, this means that the user must repeat the uttered command word more often so that it would be recognized correctly at a sufficient probability.
In the recognition of a numerical sequence, almost all errors are equally significant. Any error in the recognition of the numbers in a sequence results in a false numerical sequence. Also the situation that the user says nothing and still a number is recognized, is inconvenient for the user. However, a situation in which the user utters a number indistinctly and the number is not recognized, can be corrected by the user by uttering the numbers more distinctly.
The recognition of a single command word is presently a very typical function implemented by speech recognition. For example, the speech recognition device may ask the user: xe2x80x9cDo you want to receive a call?xe2x80x9d, to which the user is expected to reply either xe2x80x9cyesxe2x80x9d or xe2x80x9cnoxe2x80x9d. In such situations where there are very few alternative command words, the command words are often recognized correctly, if at all. In other words, the number of substitution errors in such a situation is very small. The greatest problem in the recognition of single command words is that an uttered command is not recognized at all, or an irrelevant word is recognized as a command word. In the following, there are three different alternative situations of this example:
1) A speech-controlled device asks the user: xe2x80x9cDo you want to receive a call?xe2x80x9d, to which the user replies indistinctly: xe2x80x9cYes . . . ye-xe2x80x9d. The device does not recognize the user""s reply and asks the user again: xe2x80x9cDo you want to receive a call? Say yes or no.xe2x80x9d Thus the user may be easily frustrated, if the device often asks the user to repeat the command word uttered.
2) The device asks the user again: xe2x80x9cDo you want to receive a call?xe2x80x9d, to which the user responds distinctly xe2x80x9cyesxe2x80x9d. However, the device did not recognize this for certain and wants a confirmation: xe2x80x9cDid you say yes?xe2x80x9d, to which the user replies again xe2x80x9cyesxe2x80x9d. Even now, no reliable recognition was made, so the device asks again: xe2x80x9cDid you say yes?xe2x80x9d. The user must repeat again the reply xe2x80x9cyesxe2x80x9d, for the device to complete the recognition.
3) Still in a third example situation, the speech-controlled device asks the user, if s/he wants to receive a call. To this, the user mumbles something vague, and in spite of this, the device interprets the user""s utterance as the command word xe2x80x9cyesxe2x80x9d and informs the user xe2x80x9cAll right, the call will be connectedxe2x80x9d. Thus, in this situation, the interpretation of the device of the user""s vague speech is closer to the word xe2x80x9cyesxe2x80x9d than to the word xe2x80x9cnoxe2x80x9d. Consequently, in this situation, words that resemble a command word begin to be incorrectly accepted.
In speech recognition methods according to prior art, it is typical to use in the recognition of the command word a time window of fixed length, during which the user must utter the command word. In another speech recognition method of prior art, the recognition probability is calculated for the command word uttered by the user, and if this probability does not exceed a predetermined threshold value, the user is requested to utter the command word again, after which a new calculation of the recognition probability is performed by utilizing the probability calculated in the previous recognition time. The recognition decision is made, if the threshold probability is achieved considering the previous probabilities. In this method, however, the utilization of repetition will easily result in an increase in the chance of the above-mentioned insertion error, wherein upon repeating a word outside the vocabulary, it is more easily recognized as a command word.
It is an aim of the present invention to provide an improved speech recognition method as well as a speech-controlled wireless communication device in which speech recognition is secured in view of prior art. The invention is based on the idea that the recognition probability calculated for an uttered command word is compared with the probability of background noise, wherein the confidence value thus obtained is used to deduce whether the recognition was positive. If the confidence value remains below a determined threshold for a positive recognition, the time window used in the recognition is extended and a new recognition is performed for the repeated utterance of the command word. If the repeated command word is not recognized at a sufficient confidence value, a comparison between the command words uttered by the user is still performed; thus, in case the recognitions of the words uttered by the user indicate that the user has uttered the same command word two times in succession, the recognition is accepted.
A method according to the present invention includes the following: a first confidence value is determined for the recognition result of the first recognition stage and a first threshold value is determined. The first confidence value is compared with the first threshold value, and if the first confidence value is greater than or equal to the first threshold value, the recognition result of the first recognition stage is selected as the recognition result of the speech command.
If the first confidence value is smaller than the first threshold value, a second recognition stage is performed for the speech command, where the time window is extended and a second confidence value is determined for the recognition result of the second recognition stage. The second confidence value is compared with the threshold value, and if the second confidence value is greater than or equal to the first threshold value, the command word selected at the second stage is selected as the recognition result for the speech command.
If the second confidence value is smaller than the first threshold value, a comparison stage is performed, where the first and second recognition results are compared to each other to determine a probability that they are substantially the same. If the probability exceeds a predetermined value, the command word selected at the second stage is selected as the recognition result for the speech command.
The speech recognition device according to the present invention includes means for calculating a first confidence value for the first recognition result, and means for comparing the first confidence value with a predetermined first threshold value, where the first recognition result is arranged to be selected as a final recognition result if the first confidence value is greater than or equal to the first threshold value.
The speech recognition device also includes means for performing the recognition stage of a second speech command, if the first confidence value is smaller than the first threshold value. The means for performing the recognition stage of the second speech command has means for extending the time window, means for selecting a second recognition result of the recognition stage of the second speech command, and means for calculating a second confidence value for the second recognition result. The means for performing the recognition stage of the second speech command also includes means for comparing the second confidence value with the predetermined first threshold value, where the second recognition result is arranged to be selected as the final recognition result, if the second confidence value is greater than or equal to the first threshold value.
In addition, the means for performing the recognition stage of the second speech command further includes means for performing a comparison stage, the comparison stage being arranged to be performed if the second confidence value is smaller than the first threshold value, where the means for performing a comparison stage includes a device for comparing the first and second recognition results to each other to determine a probability that they are substantially the same. If the probability exceeds a predetermined value, the second recognition result is selected as the final recognition result.
Furthermore, the wireless communication device according to the present invention has means for calculating a first confidence value for the first recognition result, and means for comparing the first confidence value with a predetermined first threshold value, where the first recognition result is arranged to be selected as a final recognition result if the first confidence value is greater than or equal to the first threshold value.
The wireless communication device includes means for performing the recognition stage of a second speech command if the first confidence value is smaller than the first threshold value. The means for performing the recognition stage of the second speech command has means for extending the time window, means for selecting a second recognition result of the recognition stage of the second speech command, and means for calculating a second confidence value for the second recognition result. The means for performing the recognition stage of the second speech command also has means for comparing the second confidence value with the predetermined first threshold value, where the second recognition result is arranged to be selected as the final recognition result if the second confidence value is greater than or equal to the first threshold value.
The means for performing the recognition stage of the second speech command still further includes means for performing a comparison stage, the comparison stage being arranged to be performed if the second confidence value is smaller than the first threshold value, where the means for performing a comparison stage includes a device for comparing the first and second recognition results to each other to determine a probability that they are substantially the same, where if the probability exceeds a predetermined value, the second recognition is selected as the final recognition result.
The present invention provides significant advantages to speech recognition methods and devices of prior art. With the method according to the invention, a smaller probability of insertion errors is obtained than is possible to achieve with methods of prior art. In the method of the invention, when the recognition is not certain, the time for interpreting the command word is extended, wherein the user has a possibility to repeat the command word given. According to the invention, it is additionally possible, if necessary, to effectively utilize the repetition of the command word by making a comparison with the command word uttered earlier by the user, which comparison improves the recognition of the command word significantly. Thus, the number of incorrect recognitions can be significantly reduced. Also the probability of such situations in which a command word is recognized although the user did not utter a command word, is reduced significantly. The method of the invention renders it possible to use such confidence levels in which the number of incorrectly recognized command words is minimal. Users who do not have their speech commands easily accepted in solutions of prior art can, by repetition of the command word according to the invention, significantly improve the probability of acceptance of speech commands uttered.