1. Field of the Invention
The present invention relates to the technical field of man-machine interface. In particular, the present invention relates to an apparatus for operating electric home appliances such as a TV and a video recorder, and a computer operated by a voice and/or a gesture without using an input apparatus based on a button operation such as a remote controller, a mouse, a keyboard, or the like.
2. Description of the Related Art
At present, an input apparatus based on a button operation such as a remote controller, a mouse, a keyboard, or the like is widely used for operating electric home appliances such as a TV and a video recorder, and a computer. An apparatus for operating electric home appliances and a computer operated by a voice and/or a gesture without using an input apparatus based on a button operation is also being developed. JP 2000-326274 A describes a technique of identifying a person and inputting a command with a voice and/or a gesture of a user in man-machine interface.
According to the technique described in JP 2000-326274 A, visual information for identifying a person is obtained by a plurality of cameras. In this case, capture control such as search control of the position of a subject is conducted using only information obtained by the cameras. Furthermore, voice information used for voice recognition is obtained by a plurality of microphones. In this case, voice input control such as search control of the direction of a voice is conducted using only information obtained by a plurality of microphones disposed on the front, back, left, and right sides of a robot.
Regarding voice input control, the technique described in JP 1(1989)-195499 A is also known. According to the technique described in JP 1(1989)-195499 A, as in a security door, the position of a mouth of an entering person is found based on object detection results obtained by an ultrasonic sensor and picture data captured by a camera, and a microphone is adjusted in the direction of the mouth.
However, the above-mentioned conventional techniques have the following problems.
The conventional technique described in JP 2000-32674 A uses capture information from a camera that corresponds to an eye and voice information from a microphone that corresponds to an ear of an apparatus or a robot; however, they are used independently. A block diagram of FIG. 10 disclosed in JP 2000-32674 A does not show that information is exchanged between picture information processing and voice information processing. Therefore, the technique described in JP 2000-32674 A has a problem that a picture of a person or a mannequin may be recognized as a human being, and voice information from a loudspeaker of acoustic equipment may be recognized as a human voice. Such matters are not intended by man-machine interface. A picture of a person, a mannequin, and a sound other than a human voice may become a noise for picture recognition and voice recognition, which decreases a recognition ratio. Furthermore, undesired information processing is conducted for inputting picture information and voice information obtained from an undesired target, which decreases a processing speed.
According to the technique described in JP 1(1989)-195499 A, as shown in FIG. 11, positional information on a search target from an ultrasonic sensor and a camera are used for controlling the direction of a microphone; however, processing results of voice information are not used. Furthermore, processing results of voice information from a microphone are not used for position detection control of a search target by an ultrasonic sensor and a camera. According to the technique described in JP 1(1989)-195499 A, in the case where a person enters an area (e.g., a door position of a room) where sensing and capturing are conducted by an ultrasonic sensor and a camera for the purpose of detecting an object, a voice can be efficiently obtained by adjusting the direction of a microphone. However, this is an effective technique only in the case where a narrow search area such as a door position of a room is previously set. Generally, in the case where there is no such limited search area, it may be often assumed that a person stands away from an ultrasonic sensor and a camera, and a command is input through a voice. Thus, the technique described in JP 1(1989)-195499 A cannot flexibly handle such a situation.
Therefore, with the foregoing in mind, it is an object of the present invention to select appropriate information as input information in man-machine interface, thereby preventing a malfunction of man-machine interface and enhancing a recognition ratio and a processing speed.
In order to solve the above-mentioned problem, a human interface system using a plurality of sensors according to the present invention includes: at least two kinds of sensors, each determining a range of a detection target and a detection sensitivity and acquiring a particular detection signal from the detection target at the detection sensitivity, the detection signals acquired by the sensors being of different types; a total analyzing part for investigating whether or not there is inconsistency among signal detection results detected by the respective sensors, and generating control information to the respective sensors; an application utilizing the signal detection results acquired by the respective sensors; and communication units for communicating data and control information between the respective sensors, between the respective sensors and the total analyzing part, and between the total analyzing part and the application, wherein each of the sensors uses either of or a combination of the signal detection results or control information obtained from the other sensors, and the control information obtained from the total analyzing part, thereby determining a range of a detection target and a detection sensitivity at a time of subsequent signal acquisition, each of the sensors outputs its signal detection results and control information used by the other sensors for determining a range of a detection target and a detection sensitivity at a time of subsequent signal acquisition, to the other sensors through the communication units, and the total analyzing part outputs control information used by each of the sensors for determining a range of a detection target and a detection sensitivity at a time of subsequent signal acquisition through the communication units.
Because of the above-mentioned configuration, an excellent human interface system can be provided, in which recognition results of a plurality of different kinds of sensors can be referred to each other, and signal acquisition control can be conducted so as not to cause inconsistency among the sensors, whereby a command inputted by a user can be recognized more exactly.
Furthermore, in the above-mentioned configuration, it is preferable that the detection target is a human being, and the sensors include at least an image sensor, a voice sensor, and an auxiliary sensor, a detection signal of the image sensor is human picture recognition information, the image sensor includes an action recognizing part for interpreting an action of the detection target based on picture recognition results, and recognizing a command inputted through a gesture, a detection signal of the voice sensor is human voice recognition information, the voice sensor includes a voice recognizing part for interpreting a voice of the detection target based on voice recognition results and recognizing a command inputted through a voice, and a detection signal of the auxiliary sensor is information useful for detecting human position information.
Because of the above-mentioned configuration, an excellent human interface system can be provided, in which action recognition results of the image sensor, voice recognition results of the voice sensor, and results of person""s position information detected by the other sensors (i.e., auxiliary sensor) are referred to, whereby a command inputted by a user can be recognized more exactly without inconsistency.
In addition to a combination of action recognition results of an image sensor, voice recognition results of a voice sensor, and person""s position information from the other sensors, the following combination of sensors and recognition results is also possible: combination of action recognition results of an image sensor and voice recognition results of a voice sensor; combination of action recognition results of an image sensor and person""s position detection results of the other sensors; and combination of voice recognition results of a voice sensor and person""s position detection results of the other sensors.
These and other advantages of the present invention will become apparent to those skilled in the art upon reading and understanding the following detailed description with reference to the accompanying figures.