1. Field of the Invention
The present invention relates to an information processing apparatus, an information processing method, and a program, and more particularly, to an information processing apparatus, an information processing method, and a program, which make it possible for a robot or the like to more properly generate information needed to actively change actions of the robot to adapt the actions to stimuli applied to the robot from the outside.
2. Description of the Related Art
For a robot expected to communicate with a human user via a voice, it is required to have a phoneme structure similar to the phoneme structure possessed by the user so that the robot can recognize phonemes uttered by the user and can utter phonemes similar to those uttered by the user. That is, the robot needs to be capable of recognizing speeches in a language spoken by the user and uttering speeches (by means of speech synthesis) in that language.
In a conventional speech recognition/synthesis technique, speeches in a language used by a user are recognized or synthesized using a dictionary of phonemes or words prepared depending on the language used by the user.
In human societies, different phonemes and languages are used depending on nations or areas. Thus, in techniques in which speech recognition or speech synthesis is performed using dictionaries that have been prepared in advance, it is necessary to prepare different dictionaries depending on nations or areas.
However, the preparation of dictionaries needs a huge cost. Thus, in the technology of robots that communicate with human users via voice, there has been in recent years a need to develop a technique to acquire phonological structures similar to those of human users via interactions such as dialogues with users without using dictionaries.
For example, in a paper entitled A Constructive Model of Mother-Infant Interaction towards Infant's Vowel Articulation” (Y. Yoshikawa, J. Koga, M. Asada, and K. Hosoda, Proc. of the 3rd International Workshop on Epigenetic Robotics, pp. 139-146, 2003 (hereinafter, this paper will be referred to as Non-Patent Document 1)), there is disclosed a robot that has an articulator and an auditory organ and that is capable of organizing itself by acquiring a phonological structure identical to that used in a human society via an interaction with a caregiver.
In the robot disclosed in Non-Patent Document 1, the articulator randomly generates parameters (motor commands) and utters sounds according to the generated parameters.
A user called a caregiver listens to the sounds uttered by the robot. If the caregiver recognizes a sound as being identical to one of phonemes used in a human society, the caregiver utters the phoneme so that the robot learns that the sound is identical to the phoneme. The learning is performed repeatedly so that the robot acquires many phonemes used in the human society.
The robot has a self-organization map associated with the auditory organ (hereinafter, referred to as the auditory SOM (Self-Organization Map) and a self-organization map associated with the articulator (hereinafter, referred to as the articulate SOM).
Each self-organization map (SOM) has a plurality of nodes, and each node has a parameter When input data (a parameter) is given to a self-organization map, a node having a parameter that is most similar to the input data is selected from all nodes (hereinafter, such a selected node will be referred to as a winner), and the parameter of the winner is modified so as to become more similar to the input data. In the self-organization map, parameters associated with nodes neighboring the winner node are also slightly modified toward the input data.
Thus, if a large number of input data is given to the self-organization map, the nodes in the self-organization map is organized such that nodes with similar parameters are located close to each other and nodes with dissimilar parameters are located far from each other. As a result, a map corresponding to a pattern of input data is formed in the self-organization map. Arranging of nodes in accordance with input data such that nodes whose parameters are similar to each other are located close to each other and a map is formed in accordance with patterns included in the input data is called self-organization.
In the technique disclosed in Non-Patent Document 1, the robot selects one of nodes in the articulate SOM, randomly changes the parameter of the selected node, and utters a sound according to the resultant parameter.
The caregiver listens to the sound uttered by the robot. If the caregiver recognizes the uttered sound as being identical to one of phonemes used in a human society, the caregiver utters the phoneme so that the robot understands that the sound is identical to the phoneme. If, in response to the sound uttered by the robot, the caregiver utters the same sound, the robot accepts the sound uttered by the caregiver as input data and determines a winner node for this input data in the auditory SOM. Furthermore, the auditory SOM (parameters associated with the node of interest and neighboring nodes) is modified, and the connection strength between the node of interest in the articulate SOM and the winner node in the auditory SOM is increased.
By performing the above-described process repeatedly, the articulate SOM and the auditory SOM are gradually established such that the connection between a node of the articulate SOM and a node of the auditory SOM, which is determined as a winner node for a sound uttered by the caregiver in response to listening to a sound generated in accordance with the parameter associated with the node of the articulate SOM, that is, the connection between the node of the articulate SOM associated with the parameter used by the robot to generate the sound and the node of the auditory SOM which is determined as the winner node for the sound which is uttered by the caregiver as the same sound as the sound generated by the robot is strengthened more greater than connections between other nodes. This makes it possible for the robot to acquire phonemes actually used in human societies and to output sounds similar to those input from the outside.
More specifically, when a voice is input to the robot from the outside, the robot searches for a node of the articulate SOM having the strongest connection with a node of the auditory SOM determined as a winner node for the input voice, and utters a sound in accordance with a parameter associated with the detected node of the articulate SOM.
In the technique disclosed in Non-Patent Document 1, the robot performs supervised learning such that when a sound uttered by the robot is identical to one of sounds actually used in a human society, the caregiver utters the same sound as that uttered by the robot to indicate that the sound is a right answer. In this technique, the robot cannot acquire phonemes unless the caregiver gives a right answer by uttering the same sound as that uttered by (a articulator of) the robot. In other words, it is impossible to perform unsupervised learning in which no right answers are given.
On the other hand, in a technique disclosed in “From Analogous to Digital Speech Sounds” (Oudeyer, P-Y, Tallerman M., editor, Evolutionary Pre-Requisites for Language. Oxford University Press, 2003) (hereinafter, this will be referred to as Non-Patent Document 2), learning is performed to acquire phonemes so that it becomes possible to generate phonemes from continuous sounds under as small a number of assumption as possible.
That is, in the learning method disclosed in Non-Patent Document 2, when there are a plurality of agents each having a auditory SOM corresponding to a auditory organ and a articulate SOM corresponding to a articulator wherein nodes of the auditory SOM and nodes of the articulate SOM are mapped (connected) to each other, initial values of parameters of respective nodes of the articulate SOM are given such that initial values are distributed uniformly and randomly over a parameter space (articulate space) before learning is started.
Note that before the learning is started, parameters associated with nodes of the articulate SOM are different among the plurality of agents.
In the learning, if a sound other than a sound uttered by a present agent, that is, a sound uttered by one of the other agents is input to the present agent, the present agent determines a winner node of the auditory SOM for the input sound and modifies parameters associated with nodes of the auditory SOM. The present agent then searches for a node of the articulate SOM having the strongest connection with the winner node of the auditory SOM and modifies the articulate SOM using the parameter associated with the detected node of the articulate SOM as a reference such that the parameter of each node of the articulate SOM becomes more similar to the parameter of the node of the articulate SOM having the strongest connection with the winner node of the auditory SOM.
Each agent selects a particular node of the articulate SOM possessed by the agent and utters a sound in accordance with a parameter associated with the selected node. If the same sound as that uttered by an agent is input to the agent, the agent determines a winner node of the auditory SOM for the input sound and increases the connection between the selected node of the articulate SOM and the winner node of the auditory SOM.
Via the repetition of the above process, the same set of sounds remains in each of the plurality of agents, that is, each agent acquires the same set of phonemes and all agents become capable of uttering the same set of phonemes.
The Non-Patent Document 2 also discloses that via the above-described learning, phonemes acquired by a plurality of agents converge on some phonemes.
Although the learning according to the technique disclosed in Non-Patent Document 2 is performed in the unsupervised learning mode in which no right answers are given, it is not intended to acquire phonemes actually used in a human society, and thus agents can not necessarily acquire same phonemes as those actually used in a human society. This is also true even when sounds uttered by a human user are input to each agent instead of sounds uttered by other agents.
This is because, in the learning according to the technique disclosed in Non-Patent Document 2, the modification of the articulate SOM is performed using parameters of some nodes of the articulate SOM as references (input), and thus parameters of nodes of the articulate SOM can change (can be modified) only within the range in which the initial values of parameters are distributed. To make it possible for each agent to acquire the same phonemes as those actually used in a human society, it is needed to give values distributed over the entire range, in which all phonemes used in the human society are included, as initial values of parameters of nodes of the articulate SOM. However, it is difficult to give such values.
It is troublesome for a user to give right answers on purpose to a robot that should acquire the same phoneme structure as that used by the user via dialogs between the user and the robot.
In view of the above, it is desirable that the robot should acquire the same phoneme structure as that used by the user via user-robot dialogs in which the user speaks without intention of giving right answers.
To acquire a phoneme structure in the above-described manner, the robot must be capable of adaptively behaving in response to stimuli applied to the robot, that is, the robot needs to adaptively speak depending on speeches of a user. That is, the robot needs to actively change a sound uttered as an action by the robot and self-evaluate the uttered sound, that is, the robot needs to evaluate (judges) whether the sound uttered by the robot is similar to the sound uttered by the user.