Hands-free voice control of equipment is useful in many places, like e.g. industrial environments or in the operation rooms of hospitals, for reasons of hygiene, safety or convenience. For adequate performance of voice control or speech control of equipment, speech recognition systems are incorporated. For such speech recognition systems, it is important that the captured voice or speech signals have a very good quality. Other sound and noise sources have a large impact and may render a speech recognition system useless. In order to improve the quality of the speech signals, a variety of signal processing techniques may be used; e.g. filtering, noise suppression and beam forming. In the case of beam forming techniques, the beams can be steered using the captured audio signals or in more advanced systems by using additional video signals. The steering is only possible if the location or position of the controlling user with respect to the system is known. Audio localization techniques provide the location of sound sources. Persons can be identified using computer vision techniques. The two techniques may be combined to define the controlling or desired user. Sometimes feedback from the speech recognizer is used to define who should be controlling the system, for example by saying an activation command.
In US 2006/0104454 A1 a system for selectively picking up a speech signal focuses on a speaker within a group of speakers who wishes to communicate something to the system using an image analysis algorithm to identify, based on a recognition feature, a position of at least one person who wishes to give the system voice commands. The detected position is used to adapt a directional microphone to the at least one person.
In clinical settings the voice control users may be doctors, cardiologists or surgeons. In general they use voice control during diagnosis or intervention. The circumstances are often sterile. The doctors typically wear a mouth cap. In industrial settings the technicians often wear a complete mask. Finding speaking persons in such settings may be a hard task to perform. Audio localization techniques are not sufficient to track or locate sound sources because of the noisy environment, and the many talking persons. Computer vision may also fail in the case where the face or a part thereof is covered.
Therefore an improved system and method for localizing the position of the person controlling equipment by voice would be advantageous. In particular, a more reliable system and method for localizing the position of a person controlling equipment by voice would be advantageous in the case where the face or part thereof is covered.