1. Field of the Invention
The present invention relates to an information processing apparatus, an information processing method, and a computer program, and, more particularly to an information processing apparatus that is inputted with information from the outside, for example, information such as an image and sound and executes an analysis of an external environment based on the input information and, specifically, analysis processing for analyzing a position, identity, and the like of a person who is uttering words, an information processing method for executing the analysis processing in the information processing apparatus, and a computer program for causing the information processing apparatus to execute the analysis processing.
2. Description of the Related Art
A system that performs processing between a person and an information processing apparatus such as a PC or a robot, for example, communication and interactive processing is called a man-machine interaction system. In the man-machine interaction system, the information processing apparatus such as the PC or the robot is inputted with image information or sound information and performs an analysis based on the input information in order to recognize actions of the person, for example, motions and words of the person.
When the person communicates information, the person utilizes not only words but also various channels such as a look and an expression as information communication channels. If a machine can analyze all of such channels, communication between the person and the machine can reach the same level as communication among people. An interface that analyzes input information from such plural channels (also referred to as modalities or modals) is called a multi-modal interface, which has been actively developed and researched in recent years.
For example, when image information photographed by a camera and sound information acquired by a microphone is inputted and analyzed, to perform a more detailed analysis, it is effective to input a large amount of information from plural cameras and plural microphones set at various points.
As a specific system, for example, a system described below is assumed. It is possible to realize a system in which an information processing apparatus (a television) is inputted with an image and sound of users (a father, a mother, a sister, and a brother) in front of the television via a camera and a microphone, analyzes, for example, positions of the respective users and which of the users uttered words, and performs processing corresponding to analysis information, for example, zooming-in of the camera on the user who spoke or accurate response to the user who spoke.
Most of general man-machine interaction systems in the past perform processing for deterministically integrating information from plural channels (modals) and determining where respective plural users are present, who the users are, and who uttered a signal. Examples of a related art that discloses such a system include JP-A-2005-271137 and JP-A-2002-264051.
However, a method of processing for deterministically integrating information using uncertain and asynchronous data inputted from a microphone and a camera performed in a system in the past lacks robustness. Only less accurate data is obtained with the method. In an actual system, sensor information that can be acquired in an actual environment, i.e., an input image from a camera and sound information inputted from a microphone are uncertain data including various extra information, for example, noise and unnecessary information. When an image analysis and a sound analysis are performed, processing for efficiently integrating effective information from such sensor information is important.