The present-invention relates to a multi-modal interface apparatus and a method to effectively interact between a user and a computer system by detecting the user""s gaze object.
Recently in the computer system such as a personal computer, in addition to input by a keyboard or a mouse and output character/image by a display, multi-media information such as sound and image is able to be inputted and outputted. In this situation, by development of natural language analysis, speech recognition and speech synthesis technique, a speech interaction system to interact to the user by speech input/output is required. Furthermore, in addition to the speech input/output, visual information of the user is inputted by a camera. By using various kinds of input/output device such as a touch panel, a pen, a tablet, a data glove, a foot switch, a head-mount display, a force display, a multi-modal interface system to interact to the user is required.
In short, by using the multi-modal interface system including various kinds of input/output device, the user can naturally interact to the computer system. The multi-modal interface system is effective method to realize a natural useful human-interface for the user.
Concretely speaking, in a dialogue between two persons, a communication is not executed by one media (for example, speech) only. The person speaks by using non-verbal message such as gesture and looks (facial expression) as natural interaction. Therefore, in order to realize natural useful human-interface, except for the speech input/output, realization of interaction by using the non-verbal message of various kinds of input/output media such as the touch panel, the pen, the tablet, the data glove, the foot switch, the head mount display, the force display, is expected.
However, analysis accuracy of input from each media is low and characteristics of input/output media is not sufficiently clear. Therefore, the multi-modal interface apparatus to effectively use a plurality of input/output media and reduce the user""s load is not realized yet. For example, in recognition processing of speech input, error recognition if occurred by ambient noise of the speech input. In recognition processing of gesture input, signal which the user intends to act as an input message is mistakenly detected from signals orderly inputted through the input media. As a result, error operation is occurred and the user""s load increases.
The user inputs speech or gesture to the multi-media interface apparatus as his intention input. However, the user often speaks or acts gesture to other person neighboring the interface apparatus. In this case, the, interface apparatus mistakenly recognizes the user s speech or gesture as input signal to the interface apparatus. As a result, error operation to the interface apparatus is occurred. The user must cancel the error operation to the interface apparatus. In order to avoid this situation, the user must pay attention to the input of the interface apparatus.
In the above case that the recognition of the input signal is not necessary, the multi-modal interface apparatus continuously recognizes the input signal. Therefore, processing speed of other service and use efficiency falls down because of processing load of the recognition. In order to solve this problem, the user often indicates a special operation for recognition mode, i.e., button push or menu selection. However, this special operation is not originally necessary in conversation between humans as an unnatural interface operation. In this case, the user""s load also increases. For example, in case of selection by button operation for speech input mode, merit of the speech media is not given full play. In short, the input of the speech media can originally communicate by user""s mouse only. Even if the user is working by his hand, his work is not disturbed while he speaks to the interface apparatus. However, if the user operates the button selection for the speech input mode, original merit of the speech media is not given full play.
In communication between humans, the non-language message such as eye-contact, gaze position, gesture and looks is very important to smoothly communicate. However, in the multi-modal interface apparatus of the prior art, the non-language message is not used at all. For example, in the multi-modal interface apparatus, output information such as dynamic image, characters over a plurality of screens, continuously changes on the display. In this case, if the user does not pay attention to the display, he can not receive all or part of the presented information. In case changeable information is presented to the user, predetermined volume of the information which the user can receive at one time is only presented. When the user indicates update operation, next volume of the information is presented. However, in this case, the user""s load also increases because of confirmation operation. The user often be puzzled at the operation and operational efficiency of the system falls down.
As for different problem of the a multi-modal interface apparatus of the prior art, processing of the touch sensor input, the image input and the distance sensor input is explained. In case the user inputs a pointing gesture. through the touch sensor, a pointed object is identified by output information of the touch sensor, i.e., coordinate data, time series data, input pressure data, or input time interval. In case of the image input, for example, an image of the user""s hand is inputted by using one or plural cameras. A shape or an action of the hand is analyzed by a method disclosed in xe2x80x9cUncalibrated Stereo, Vision With Pointing for a Man-Machine Interface (R. Cipolla, et. al., Proceedings of MVA"" 94, IAPR Workshop on Machine Vision Application, pp. 163-166, 1994)xe2x80x9d, In short, the user can indicate the object in real world or on the display by the gesture input. In case of the distance sensor by using an Infrared ray, a position, the shape and the action of the user""s hand is analyzed and recognized. Therefore, in same way of the image input, the user can indicate the object by the gesture input.
However, in the interface method to detect the shape, the position or the movement of the user""s hand by using the camera, sufficient fine degree of detection is not obtained. In short, a gesture which the user intends to input is not correctly recognized. Otherwise, the shape or the movement of the hand which the user does not intend to input is mistakenly extracted as the gesture. As a result, erroneus activation generated by erroneus recognition is necessary to be corrected. In case the gesture which the user intends to input is not correctly inputted to the system, the user must input the gesture again and the user""s load greatly increases.
The gesture input of the user is firstly inputted to the system when the gesture is completely analyzed by the system. Therefore, while the user is inputting the gesture by his action, he can not understand whether the system correctly recognizes the gesture input. For example, start timing of the gesture is not correctly detected by the system. As a result, erroneus recognition is executed and the user must input the gesture to the system again. Otherwise, even if the user is not inputting the gesture by his action, the system mistakenly recognizes the user""s unconscious action as the gesture and executes erroneus activation. In this case, the user must correct the affect of the erroneus activation.
Furthermore, the gesture recognition method by using an input device of touch system such as the touch sensor, the user indicates a part of the input device itself. Therefore, he can not input a pointing gesture to indicate object in real world except for the input device. On the other hand, in the recognition method of the pointing gesture input by using untouch system such as the camera or the infrared ray sensor, the user can indicate the object or the place in the real world. However, the system can not properly inform the indicated object or place to the user while he is inputting the gesture to point the object or the place.
It is an object of the present invention to provide a multi-modal interface apparatus and a method to smoothly, communicate between the user and the apparatus by using the user""s gaze object.
It is another object of the present invention to provide a multi-modal interface apparatus and a method for the user to effectively operate the apparatus by using the user""s looks.
According to the present invention, there is provided a multi-modal interface apparatus, comprising: gaze object detection means for detecting a user""s gaze object; media input means for inputting at least one medium of sound information, character information, image information and operation information from the user; personified image presentation means for presenting a personified image to the user based on the user""s gaze object; and control means for controlling a reception of inputted media based on the user""s gaze object.
Further in accordance with the present invention, there is also provided a multi-modal interface method, comprising the steps of: detecting a user""s gaze object; inputting at least one medium of sound information, character information, image information and operation information from the user; presenting a personified image to the user based on the user""s gaze object; and controlling a reception of inputted media based on the user""s gaze object.
Further in accordance with the present invention, there is also provided a computer readable memory containing computer readable instructions, comprising: instruction means for causing a computer to detect a user""s gaze object; instruction means for causing a computer to input at least one medium of sound information, character information, image information and operation information from the user; instruction means for causing a computer to present a personified image to the user based on the user""s gaze object; and instruction means for causing a computer go to control a reception of inputted media based on the user""s gaze object.
Further in accordance with the present invention, there is also provided a multi-modal interface apparatus, comprising: image input means for inputting a user""s image; recognition means for extracting a gesture input from the user""s image, and for recognizing the gesture input as the user""s action status information; control means for determining at least one of looks and action of a personified image based on the user""s action status information, and for generating the personified image including the at least one of looks and action; and personified image presentation means for presenting the personified image to the user as a feedback information.
Further in accordance with the present invention, there is also provided a multi-modal interface method, comprising the steps of: inputting a user""s image; extracting a gesture input from the user""s image; recognizing the gesture input as the user""s action status information; determining at least one of looks and action of a personified image based on the user""s action status information; generating the personified image including the at least one of looks and action; and presenting the personified image to the user as a feedback information.
Further in accordance with the present invention, there is also provided a computer readable memory containing computer readable instructions, comprising: instruction means for causing a computer to input a user""s image; instruction means for causing a computer to extract a gesture input from the user""s image; instruction means for causing a computer to recognize the gesture input as the user""s action status information; instruction means for causing a computer to determine at least one of looks and action of a personified image based on the user""s action status information; instruction means for causing a computer to generate the personified image including the at least one of looks and action; and instruction means for causing a computer to present the personified image to the user as a feedback information.
Further in accordance with the present invention, there is also provided multi-modal interface apparatus, comprising: image input means for inputting a user""s face image during the user""s operation of predetermined object on a display; face image processing means for extracting a feature from the user""s face image; recognition decision means for storing the feature of the user""s face image as a dictionary pattern and the user""s operation of predetermined object, as an event, and for recognizing a user""s face image newly inputted through said image input means by referring to the dictionary pattern; and object control means for controlling the event of predetermined object on the display based on a recognition result of said recognition decision means.
Further in accordance with the present invention, there is also provided a multi-modal interface method, comprising the steps of: inputting a user""s face image during the user""s operation of predetermined object on a display; extracting a feature from the user""s face image; storing the feature of the user""s face image as a dictionary pattern and the user""s operation of predetermined object as an event; recognizing a user""s face image newly inputted by referring to the dictionary pattern; and controlling the event of predetermined object on the display based on a recognition result at recognizing step.
Further in accordance with the present invention, there is also provided a computer readable memory containing computer readable instructions, comprising: instruction means for causing a computer to input a user""s face image during the user""s operation of predetermined object on a display; instruction means for causing a computer to extract a feature from the user""s face image; instruction means for causing a computer to store the feature of the user""s face image as a dictionary pattern and the user""s operation of predetermined object as an event; instruction means for causing a computer to recognize a user""s face image newly inputted by referring to the dictionary pattern; and instruction means for causing a computer to control the event of predetermined object on the display based on a recognition result at recognizing step.