The present disclosure relates to an information processing apparatus, an information processing method, and a program, and more particularly to an information processing apparatus, an information processing method, and a program that execute various processes on the basis of utterances of a user or the like.
When using a personal computer (PC), a television set, a recording/playback device, or other household electric appliances, a user operates an input unit provided for each apparatus or a remote control in order to cause the apparatus to execute a desired process. For example, when a PC is used, a keyboard and a mouse are typically used as input devices. In addition, in the case of a television set, a recording/playback device, or the like, a remote control is used to cause the apparatus to execute various processes such as, for example, switching of the channel and selection of a content to be played back.
Various studies have been conducted on a system in which a user can instruct various apparatuses by making utterances or through movements. More specifically, there are systems such as one in which an utterance of a user is recognized by using a speech recognition process and one in which an action or a gesture of a user is recognized by using an image recognition process.
An interface through which communication with a user is executed using a general input device such as a remote control, a keyboard, or a mouse as well as various communication modes such as speech recognition and image recognition is called a “multimodal interface”. An example of the related art in which a multimodal interface is disclosed is U.S. Pat. No. 6,988,072.
However, a speech recognition apparatus and an image recognition apparatus used with a multimodal interface or the like are limited in terms of the processing capabilities thereof, and can only understand a limited number of types of utterances and movements of a user. Therefore, currently, there are many situations in which the intention of a user is not understood by the system exactly. In a system adopting speech recognition, in particular, although a user can experience natural interaction when the number of types of commands that he/she can speak is increased, the user may have difficulties sometimes with regard to what to speak next since it is difficult for him/her to know the available commands that can be received by the system.