The present disclosure relates to an information processing apparatus, an information processing method, and a program, and more particularly, to an information processing apparatus, an information processing method, and a program which analyze an external environment based on input information by inputting input information from the outside world, for example, information such as images, voices and the like, and specifically, analyzes the position of a person who is speaking and the like.
The present disclosure relates to an information processing apparatus, an information processing method, and a program which identify a user who is speaking and analyze each utterance when a plurality of persons are speaking simultaneously.
A system that performs an interactive process between a person and information processing apparatuses such as a PC or a robot, for example, a communication process or an interactive process is referred to as a man-machine interaction system. In a man-machine interaction system, the information processing apparatus such as a PC, a robot, or the like performs analysis based on input information by inputting image information or voice information to recognize human actions such as human behavior or words.
When a person transmits information, in addition to words, various channels for gestures, line of sight, facial expressions, and the like are used as information transmission channels. If it is possible to analyze all of these channels in a machine, even communication between people and machines may reach the same level as that of communication between people. An interface capable of analyzing input information from these multiple channels (also referred to as modality or modal) is called a multi-modal interface, and development and research into such an interface have been conducted extensively in recent years.
For example, when performing analysis by inputting image information captured by a camera and sound information obtained by a microphone, in order to perform more specific analysis, inputting a large amount of information from a plurality of cameras and a plurality of microphones which are positioned at various points is effective.
As a specific system, for example, the following system is assumed. An information processing apparatus (a television) receives or is input images and voices of users (father, mother, sister, and brother) in front of the television via a camera and a microphone, and analyzes the position of each of the users, which user is speaking, and the like, so that a system capable of performing processes according to analysis information such as the camera zooming-in with respect to a user who has spoken, making an adequate response with respect to the user who has spoken, or the like may be realized.
Examples of the related art in which an existing man-machine interaction system is disclosed include, for example, Japanese Unexamined Patent Application Publication No. 2009-31951 and Japanese Unexamined Patent Application Publication No. 2009-140366. In the related art, a process in which information from a multi-channel (modal) is integrated in a probabilistic manner, and the position of each of a plurality of users, who are the plurality of users, and who is issuing signals, that is, who is speaking are determined with respect to each of the plurality of users is performed.
For example, when determining who is issuing the signals, virtual targets (tID=1 to m) corresponding to the plurality of users are set, and a probability that each of the targets is an utterance source is calculated from analysis results of image data captured by a camera or sound information obtained by a microphone.
Specifically, for example, the following amounts are calculated,
(a) sound source direction information of a voice event obtainable via the microphone, user position information obtainable from utterer identification (ID) information, and an utterance source probability P(tID) of a target tID obtainable from only the user identification information, and
(b) an area SΔt(tID) of a face attribute score [S(tID)] obtainable by a face recognition process based on images obtainable via a camera,
wherein (a) and (b) are calculated to thereby calculate an utterer probability Ps(tID) or Pp(tID) of each (tID=1 to m) of the targets by addition or multiplication based on a weight α using α as a preset allocation weight coefficient.
In addition, the details of this process are described in, for example, Japanese Unexamined Patent Application Publication No. 2009-140366.
In the calculation process of the utterer probability in the above described related art, it is necessary that the weight coefficient α is adjusted beforehand as described above. Adjusting the weight coefficient beforehand is cumbersome, and when the weight coefficient is not adjusted to a suitable numerical value, there is a problem in that the validity of the calculation result of the utterer probability is greatly affected.