1. Field of the Invention
The present invention relates to an interactive apparatus and a method for interacting with a user by a plurality of input/output units usable in combination, and a computer program product for executing the method.
2. Description of the Related Art
Recently, along with popularization of video recording-reproducing apparatuses such as a hard disk recorder and a multi-media personal computer, and with an increase in memory capacity of the video recording-reproducing apparatuses, there has been a new television-viewing style that many broadcast programs are recorded and a preferred program is viewed after completion of the program according to the user's preference.
Furthermore, digitalization of television broadcasting leads to an increase of the number of programs available to viewers, and along with an increase of the size of the memory capacity of video recording apparatuses, it can be time-consuming to search for only a program to be viewed, from a vast number of television programs recorded on the video recording apparatus.
Currently, as a human-machine interface of television and video recording-reproducing apparatuses, an interface using a remote controller for operating a ten key and a cursor key is generally used. For example, when recording of a television program is reserved or a recorded program is searched by using the remote controller, an item needs to be specified and selected one by one from a menu or a character list displayed on a television screen (hereinafter, “TV screen”). For example, when a keyword for program search is to be input, the keyword needs to be input by selecting a character one by one from the displayed character list, which is a time-consuming operation.
Further, televisions having an access function to the Internet have been already commercialized. With such televisions, a user can access and browse websites on the Internet via the television. Generally, this type of television also uses a remote controller as its interface. In this case, if the television is used only for browsing a website by clicking a link of the website, the operation is simple and there is no particular problem. However, when the keyword is input to search for a desired website, there is the same problem as that in the search of the television program.
Further, in an operation interface via the TV screen using a remote controller, it is assumed that a menu or the like is displayed on its TV screen, and therefore the operation cannot be performed from a remote place where the screen cannot be seen directly, or in a situation in which the user is busy with something.
For example, when it is assumed that the television is watched in a case that, while a recorded cooking program is being reproduced, cooking is performed according to the program, rewinding to a missed scene happens frequently according to need. However, there is a possibility that the user may not be able to release her hands during cooking or the hands may not be clean, and therefore the user cannot operate a remote controller by hand, or there can be a sanitary problem.
On the other hand, televisions having a function for recording a currently displayed program for a certain period of time and temporarily stopping the displayed program according to an instruction from a user by a remote controller or a function for rewinding to a necessary scene are also commercially available. When a user of this type of television is viewing a weather forecast in a busy time in the morning, the user can miss a scene of or fail to hear the forecast for a concerned local area, because the user cannot use the remote controller. Further, in a situation in which the user cannot release the hands, for example, due to changing clothes, the user may have to stop watching the screen because holding the remote controller to instruct rewinding is time-consuming.
To solve these problems, a human-machine interface based on speech input is desired rather than a general interface by a remote controller. Therefore, multimodal interface technology based on speech input has been studied. The multimodal interface technology enables electronic information equipment to be operated by speech, even when its user cannot use hands, by providing a speech recognition function to the electronic information equipment such as a television.
In the interface using the speech input, it is assumed that various instructions different for each user are input, as compared with a case that an operation is instructed by the remote controller from a menu or the like displayed fixedly. Therefore, it is required to realize natural interaction by recognizing the input speech accurately to return an appropriate response.
Japanese Patent No. 3729918 has proposed a technique for changing over output media corresponding to a situation of an interaction plan in a multimodal interactive apparatus. For example, in Japanese Patent No. 3729918, when speech recognition has failed resulting from an error in spoken sound due to user's misunderstanding or unregistered words, the multimodal interactive apparatus detects this problem, changes over to another medium such as pen input other than speech, to urge the user to input. Accordingly, interruption of interaction can be avoided and smooth interaction can be realized.
JP-A 2005-202076 (KOKAI) discloses a technique for smoothing interaction according to a distance between a user and a system. Specifically, in JP-A 2005-202076 (KOKAI), a smoothing technique of interaction is proposed, in which, when there is a considerable distance between a robot and a user, the volume of speech of the robot is increased, because there is a high possibility that the speech of the robot cannot be heard by the user.
However, according to the methods disclosed in Japanese Patent No. 3729918 and JP-A 2005-202076 (KOKAI), because it is not taken into consideration whether the user is watching the apparatus to be operated, there is a problem that there can be a case that an appropriate response cannot be returned to the user.
For example, when the user instructs a video recording apparatus to rewind, if the user is watching the TV screen, the user can understand the program content including images and speech by directly reproducing the program after completion of rewind. However, because the television program is produced relative to viewers who watch the TV screen, if the user is present where the TV screen cannot be seen, the user may not be able to understand the program content only by presenting the program as it is.
For example, it is assumed that there is a scene of cutting potatoes in a cooking program, and the user has missed the scene and asks a question, “how should I cut the potatoes”, to a system. In this scene, there is a possibility that only images are projected without giving any particular oral explanations, and a caption “Cut it into dices” is presented. In this case, if the user is watching the TV screen carefully, only, presenting the scene of cutting potatoes by images is sufficient. However, because there is no oral explanation, if the user is not watching the TV screen carefully, only reproduction of the scene is not sufficient as the response to the question.
Particularly, when the speech input is used as the interface, there is a high possibility of occurrence of the above problems, because it is considered that the user frequently inputs an operation instruction without watching the TV screen carefully.