Recently, in the computer system such as a personal computer, in addition to input by a keyboard or a mouse and output character/image by a display, multi-media information such as sound and image is able to be inputted and outputted. In this situation, by development of natural language analysis, speech recognition and speech synthesis technique, a speech interaction system to interact to the user by speech input/output is required. Furthermore, in addition to the speech input/output, visual information of the user is inputted by a camera. By using various kinds of input/output device such as a touch panel, a pen, a tablet, a data glove, a foot switch, a head-mount display, a force display, a multi-modal interface system to interact to the user is required.
In short, by using the multi-modal interface system including various kinds of input/output device, the user can naturally interact to the computer system. The multi-modal interface system is effective method to realize a natural useful human-interface for the user.
Concretely speaking, in a dialogue between two persons, a communication is not executed by one media (for example, speech) only. The person speaks by using non-verbal message such as gesture and looks (facial expression) as natural interaction. Therefore, in order to realize natural useful human-interface, except for the speech input/output, realization of interaction by using the non-verbal message of various kinds of input/output media such as the touch panel, the pen, the tablet, the data glove, the foot switch, the head mount display, the force display, is expected.
However, analysis accuracy of input from each media is low and characteristics of input/output media is not sufficiently clear. Therefore, the multi-modal interface apparatus to effectively use a plurality of input/output media and reduce the user's load is not realized yet. For example, in recognition processing of speech input, error recognition is occurred by ambient noise of the speech input. In recognition processing of gesture input, signal which the user intends to act as an input message is mistakenly detected from signals orderly inputted through the input media. As a result, error operation is occurred and the user's load increases.
The user inputs speech or gesture to the multi-media interface apparatus as his intention input. However, the user often speaks or acts gesture to other person neighboring the interface apparatus. In this case, the interface apparatus mistakenly recognizes the user's speech or gesture as input signal to the interface apparatus. As a result, error operation to the interface apparatus is occurred. The user must cancel the error operation to the interface apparatus. In order to avoid this situation, the user must pay attention to the input of the interface apparatus.
In the above case that the recognition of the input signal is not necessary, the multi-modal interface apparatus continuously recognizes the input signal. Therefore, processing speed of other service and use efficiency falls down because of processing load of the recognition. In order to solve this problem, the user often indicates a special operation for recognition mode, i.e., button push or menu selection. However, this special operation is not originally necessary in conversation between humans as an unnatural interface operation. In this case, the user's load also increases. For example, in case of selection by button operation for speech input mode, merit of the speech media is not given full play. In short, the input of the speech media can originally communicate by user's mouse only. Even if the user is working by his hand, his work is not disturbed while he speaks to the interface apparatus. However, if the user operates the button selection for the speech input mode, original merit of the speech media is not given full play.
In communication between humans, the non-language message such as eye-contact, gaze position, gesture and looks is very important to smoothly communicate. However, in the multi-modal interface apparatus of the prior art, the non-language message is not used at all. For example, in the multi-modal interface apparatus, output information such as dynamic image, characters over a plurality of screens, continuously changes on the display. In this case, if the user does not pay attention to the display, he can not receive all or part of the presented information. In case changeable information is presented to the user, predetermined volume of the information which the user can receive at one time is only presented. When the user indicates update operation, next volume of the information is presented. However, in this case, the user's load also increases because of confirmation operation. The user often be puzzled at the operation and operational efficiency of the system falls down.
As for different problem of the multi-modal interface apparatus of the prior art, processing of the touch sensor input, the image input and the distance sensor input is explained. In case the user inputs a pointing gesture through the touch sensor, a pointed object is identified by output information of the touch sensor, i.e., coordinate data, time series data, input pressure data, or input time interval. In case of the image input, for example, an image of the user's hand is inputted by using one or plural cameras. A shape or an action of the hand is analyzed by a method disclosed in "Uncalibrated Stereo Vision With Pointing for a Man-Machine Interface (R.Cipolla, et. al., Proceedings of MVA' 94, IAPR Workshop on Machine Vision Application, pp. 163-166, 1994)". In short, the user can indicate the object in real world or on the display by the gesture input. In case of the distance sensor by using an infrared ray, a position, the shape and the action of the user's hand is analyzed and recognized. Therefore, in same way of the image input, the user can indicate the object by the gesture input.
However, in the interface method to detect the shape, the position or the movement of the user's hand by using the camera, sufficient fine degree of detection is not obtained. In short, a gesture which the user intends to input is not correctly recognized. Otherwise, the shape or the movement of the hand which the user does not intend to input is mistakenly extracted as the gesture. As a result, erroneus activation generated by erroneus recognition is necessary to be corrected. In case the gesture which the user intends to input is not correctly inputted to the system, the user must input the gesture again and the user's load greatly increases.
The gesture input of the user is firstly inputted to the system when the gesture is completely analyzed by the system. Therefore, while the user is inputting the gesture by his action, he can not understand whether the system correctly recognizes the gesture input. For example, start timing of the gesture is not correctly detected by the system. As a result, erroneus recognition is executed and the user must input the gesture to the system again. Otherwise, even if the user is not inputting the gesture by his action, the system mistakenly recognizes the user's unconscious action as the gesture and executes erroneus activation. In this case, the user must correct the affect of the erroneus activation.
Furthermore, the gesture recognition method by using an input device of touch system such as the touch sensor, the user indicates a part of the input device itself. Therefore, he can not input a pointing gesture to indicate object in real world except for the input device. On the other hand, in the recognition method of the pointing gesture input by using untouch system such as the camera or the infrared ray sensor, the user can indicate the object or the place in the real world. However, the system can not properly inform the indicated object or place to the user while he is inputting the gesture to point the object or the place.