1. Field of the Invention
The present invention relates to a human-machine interaction apparatus adaptable to a data processing apparatus and having a plurality of input devices and output devices that can be combined with one another for use.
2. Description of the Related Art
In recent years, computers have been provided with a plurality of different type input devices, such as a keyboard, a mouse, a microphone, a touch panel, an image scanner, a video camera, a pen, a data glove and a human detection sensor to enable various information items to be input in various forms. Also an output device, such as a display unit and a speaker unit, has been provided with functions capable of outputting various information items in a variety of forms, such as spoken language, effect sound and music. Thus, there arises a requirement for realizing a human-machine interface which can easily be operated by effectively using the variety of the input and output devices.
In recent years, a multimodal interface has been researched and developed energetically. The interface includes an input means and a plurality of output devices. The input means is formed by combining a plurality of input devices and is capable of performing a complicated input, that is, input media using the foregoing input. The output device is capable of, for example, issuing a voice command from a user while pointing a subject on a display with the finger of the user, that is, output media. Moreover, an output means which uses a plurality of output devices, such as a display unit and a speaker unit, and which is capable of performing a complicated output formed by combining a plurality of output devices, the contents and forms of the outputs from the same such that, for example, a nuance is communicated with the expression of the face of a human being displayed on a display unit and effect sound from a speaker unit while announcing a message by a spoken language from the speaker unit, that is, output media is provided. Thus, the interface has been intended to be easily operated and the quality and efficiency in information transmission have been required to be improved.
Hitherto, in order to realize smooth and natural communication of information between a user and an application software on a computer, the communication of the information has been considered as interaction between the user and the application software and interaction plan has been developed in accordance with a predetermined stored interaction rule so that multimodal interaction has been realized by means of a combination of input and output methods determined by the interaction rule.
However, a method in which a combination of devices for use in the input and output to be employed in each stage of the interaction and a method of use of the devices, that is, a method in which media allocation is previously described in the interaction rule, sometimes encounters a problem. In a case where voice is determined as the input and output forms, use of voice as input and output means is unsuitable if the circumferential noise level is too high. However, the method cannot solve the foregoing problem. That is, there arises a problem in that an appropriate combination of input and output means suitable to the situation cannot be selected to perform interaction with a user because the media allocation is arranged fixedly regardless of the flow of the interaction.
A case will now be considered in which interaction apparatus permitting a user to perform voice input is operated. Since the level of the voice recognition technique has been unsatisfactory at present, a case will occur in which recognition fails even if a certain user repeatedly gives the same word depending upon the characteristic of the user. The foregoing case frequently takes place if the pronunciation of the user have excessive accents. Thus, the possibility for the interaction apparatus to succeed in recognition is lowered excessively.
Since the conventional multimodal interaction apparatus, however, repeatedly requires a user of the foregoing type to again perform the voice input, there arises a problem in that time is wasted and the user feels stress.
A tourist guide system employing voice instruction will now be considered with which recognition of a voice input of a place name is performed to retrieve a required tourist resort from a database and to output a result of the retrieval to an output means. In a case where the tourist guide system in Scotland has requested a user to voice-input a place name and the American tourist does not know the dialect pronunciation "lax ness or lok ness" of Loch Ness famous for Nessie thus pronounces "lak ness or lok ness", and thus the system has failed in the retrieval, the conventional multimodal interaction apparatus having no means, which is capable of dynamically performing input and output media allocation, encounters a difficulty in restoring the communication with the user.
That is, the fact that the user does not know the correct reading of the Loch Ness as "lax ness or lok ness" causes the failure to occur in the communication with the user. Although the communication will probably be restored if the mode of the system is switched to, for example, a character input selection mode, the input and output media allocation cannot be changed to be adaptable to the situation and thus the system is brought to a stiff state. As a result, the guidance service cannot be provided.
In a case where the conventional interaction apparatus makes a presentation of the contents of a result of retrieval of the database to a user, the output form is fixed as programmed previously. That is, one fixed output form is employed regardless of the number of outputs of the results of the retrieval. Thus, the input and output methods are sometimes difficult to be understood and operated for a user such that tens to hundreds of results of retrieval are one by one read by means of voice or only several results of retrieval are displayed in the form of a table.
Since the system cannot conduct interaction with a user by the allocated media required by the user, there arises a problem in that the user is required to perform input and output by the media allocated and instructed by the system.
Although a variety of input means are available, a common input means which can easily be operated for all users does not exist. For example, a certain user likes voice input, while another user likes input of a command character string by using a keyboard.
As described above, the optimum input and output means are different among users. However, the conventional interaction apparatus has not been structured to be adaptable to the difference existing among the persons. Thus, a fixed combination of input and output means is provided and the user cannot use required input and output means. Therefore, the conventional interaction apparatus cannot easily be operated.
With the conventional multimodal interaction apparatus having a plurality of input means, the input means to be employed and the input means which can be employed at a certain point of time cannot easily be recognized by a user. Thus, there arises a problem in that the user feels a stress or the user is perplexed when the user intends to perform the input.
In a case where voice cannot preferably be used with a conventional interaction apparatus which is capable of performing voice input and output due to intense external noise or the like, the conventional interaction apparatus, which cannot perform dynamic media allocation, encounters a problem in that the input and output methods cannot be changed so as to be adaptable to change in the environmental factors.
As described above, with the conventional multimodal interaction apparatus, a user is forced to conduct interaction with the system in accordance with the combination of the input and output modes determined previously in accordance with the interaction rule set to the system. Thus, the interface of the conventional interaction apparatus has been difficult for a user to understand and operate, thus resulting in the user feeling a stress when the user inputs information. Moreover, the user cannot easily understand the output from the apparatus. Thus, the interface sometimes causes a failure in input and output to take place. Even if a failure takes place in input and output due to the characteristic of a selected specific input and output means such that a failure of input or erroneous input is probable to occur due to a failure in recognition with the voice word recognition mode, the purpose of the interaction cannot be completed.