The invention finds a particularly advantageous application in the field of automatic speech recognition, in particular in multimedia terminals such as new generation mobile terminals, personal digital assistants (PDA), and remote controls incorporating a microphone.
In theory, when communicating with a voice server equipped with an automatic speech recognition system, a user of a mobile terminal, for example, does not have to effect any particular action to tell the recognition system that he is about to utter a voice sequence. In fact, the system is either always listening to the user or in a position to determine when the user is going to speak from the structure of the dialogue between the server and the user.
If it is always listening to the user, the recognition system searches the continuous sound stream that it receives for time periods that might correspond to voice sequences uttered by the user. This search is effected by means of a voice activity detector. This is known in the art. Of course, for this system to work correctly, voice activity detection must not generate too many false alarms or, failing this, the automatic speech recognition mechanism must reject false alarms.
This is why voice activity detection gives the best results in a close miking context, with the microphone close to the mouth of the speaker, favoring reception of the voice of the user over background noise that interferes with speech recognition.
Now, at present, with the growth of multimedia terminals, in order to enable the user to listen to voice messages while simultaneously reading information displayed on the screen of the terminal, “hands-free” miking is becoming more and more generalized. This makes automatic speech recognition more difficult, the level of the wanted voice signal decreasing while the background noise remains constant.
Moreover, as the user now has available media other than voice, it is becoming difficult for a recognition system to determine when the speaker is going to utter a voice sequence from the structure of the dialogue alone.
It is to remedy these drawbacks that some terminals are equipped with means enabling the user to trigger voice recognition processing, for example by pressing a key of a device known as a “push-to-talk” device. When the speaker starts to utter a voice sequence, he presses the key of this device to indicate to the server that the subsequent sound signal is a voice sequence that the speech recognition system must process. The speaker releases the key at the end of uttering said voice sequence. Thus the system attempts to recognize the user only when the user is pressing the key of the “push-to-talk” device, which prevents false alarms during periods in which the key is not pressed.
However, the “push-to-talk” device has the drawback that if the user begins to utter a voice sequence before pressing the key of the device, or continues said sequence after releasing the key, then the recognition system will not use the real sequence but will rather use a sequence that has been truncated in time.
Thus the technical problem to be solved by the present invention is to propose a method of synchronizing an operation of processing, by automatic recognition, speech in a voice sequence uttered by a speaker and an action by said speaker intended to trigger said processing, in such a manner as to reduce recognition errors that could arise because of imperfect synchronization between the triggering action inserted by the speaker and the start and the end of the voice uttered sequence.