1. Field of the Invention
The present invention relates to a data processing device, a data processing method, and a program, and more particularly relates to a data processing device, a data processing method, and a program whereby a robot, for example, can carry out large-scale learning in a practical manner and act autonomously.
2. Description of the Related Art
Forward models and inverse models can be applied to realize robots which autonomously perform tasks, for example.
FIG. 1 illustrates the concept of forward models and inverse models.
Let us say that there is certain input data serving as time-sequence data (data in time-sequence), and there is an object of control, which outputs output data serving as other time-sequence data, that is provided for the input data. Here, while detailed information relating to the object of control is unknown (i.e., while the interior of the object of control is unknown), the input data provided to the object of control, and the output data obtained from the object of control with regard to the input data, can be observed.
The physical value of the input data provided to the object of control and the output data obtained from the object of control with regard to the input data may be large or small, as long as it is observable. Also, any object (thing) will work as long as input data can be provided thereto and further output data can be obtained as to the input data.
Accordingly, various objects can be the object of control, examples of which include a ball, musical instrument, automobile, gas stove, to mention just a few. For example, in the case of a ball, applying (providing) force as input data yields the position and speed of the ball as output data which changes as to the input data. Also, in the case of an automobile, operating the steering wheel, gas pedal, brake, etc., i.e., providing operations thereof, yields the position and speed of the automobile as output data which changes as to the input data. Further, in the case of a gas stove, operating the size of the flame as input data yields room temperature as output data which changes as to the input data.
It should be noted that the term “data” as used in “input data”, “output data”, later-described “control data”, and so forth, throughout the present Specification, and the drawings, claims, and all other documents attached thereto, is not restricted to the concept of structured or formatted information; rather, this term encompasses all forms of energy and force applied to the object or effected thereby, as long as such can be physically observed, measured, and/or quantified. A specific example of the scope of such input would be to say that the action of operating a valve, for example, in the above-described gas stove, to change the size of the flame would constitute such input data, but the intent of operator to do so would not. More specifically, any physical action of which the physical value is meaningful to, or effectually acts upon the object, is what is meant by this term, and accordingly, verbal instructions given to the gas stove would not be included in this scope if the gas stove is only operable by a twist knob for example, but would be included in this scope if the gas stove were provided with, for example, a microphone, speech recognition functions, command analysis functions, and a mechanism to execute physical action of changing the flame size so as to carry out the verbal command issued by the user. On the other hand, in a rather unlikely case wherein the input data to be applied is to physically throw the gas stove a certain distance, for example, the force applied thereto to that end would be the input data. In this way, the intent, or motive, behind the input data is not unrelated to what constitutes the input data; however, the intent or motive is never part of the input data. Moreover, even in a case wherein control of the object is realized by electroencephalography, such as technology being developed by MIT Media Lab Europe wherein a device or computer can be controlled wirelessly directly from the human brain, the output from the headset would serve as the input data to the object of control, while the intent or motive of the user would not. The scope of the term “data” as used in the present specification is to be thus understood.
With an arrangement wherein input data is thus provided to an object of control and output data is obtained thereby, the modeled object of control is a forward model.
With a forward model, upon inputting input data (upon input data being provided), a prediction value of output data obtained from the object of control as to that input data is output. Accordingly, with a forward model, output data which would be obtained from the object of control as to input data can be predicted even without providing the object of control with actual input data.
On the other hand, an inverse model is a model wherein a target value of output data obtained from the object of control is determined, and the input data to be provided to the object of control so as to obtain the output data of that target value is predicted. While a forward model can be viewed as mapping input data to output data, an inverse model is the opposite thereof.
Hereinafter, the input data to be provided to the object of control so as to obtain output data of the target value with an inverse model will be referred to as “control data” as appropriate.
Forward models and inverse models such as described above can be applied to robots, more particularly to the configuration of robots.
Let us say that a robot has a microphone and camera so as to be capable of input of audio (sound) data and image data, and also has a speaker and actuator (motor) so as to be capable of outputting audio (audio data) and moving an arm by driving the motor following motor data (motor signals).
With such a robot, a traditional approach for outputting audio data as output data or moving a desired arm as output data, in response to input data such as audio data or image data, is to use an audio recognition device or image recognition device and to program (design) beforehand what sort of audio data should be output or what sort of motor data should be output in response to recognition results of the audio data or image data input to the robot.
Conversely, using a forward model enables a robot which outputs desired audio data as output data or moving a desired arm as output data, in response to input data such as audio data or image data to be envisioned as an object of control, and the actual robot to be configured as a forward model of the robot envisioned as the object of control (hereinafter referred to as “anticipated robot” as suitable), as shown in FIG. 2. That is to say, a robot can be configured as a forward model of the anticipated robot, if an actual robot can be made to learn the relation between input data and output data to and from the anticipated robot.
Specifically, input data such as the audio data and image data to be input to the anticipated robot, and output data such as audio data and motor data to be output in response to the respective input data, are prepared beforehand as a set, and provided to an actual robot. If the actual robot can obtain a forward model of the anticipated robot predicting (i.e., outputting) output data corresponding to the input data, using only the set of input data and output data external provided thereto (hereinafter referred to as “teaching data” as suitable), then output data such as desired audio data and motor data and the like can be output in response to input data such as audio data and image data and the like which is actually input.
Also, using an inverse model enables arm control equipment for controlling a robot arm, as the object of control, as shown in FIG. 3.
That is to say, let us say that there is a robot arm here which is moved by a motor which performs driving according to motor data, which is input data, and that the position of the tip of the arm changes accordingly. Further, let us say that, with the center of gravity of the robot as the point of origin thereof, the position of the tip of the arm can be represented with the coordinates (x, y, z) in a three-dimensional coordinate system, in which the forward (frontal) direction of the robot is the x axis, the sideways direction of the robot as the y axis, and the vertical direction thereof as the z axis. In this case, the motor performs driving in accordance with the motor data so as to further change the position of the tip of the arm, such that the tip of the arm traces a certain path, in accordance with the three-dimensional coordinate system. Note that here, the sequence of coordinates of the path which the tip of the arm traces (tip position path) will be referred to as “tip position path data”.
In order to cause the arm to trace a desired tip position path, i.e., in order to obtain output of desired tip position path data as the output data, motor data whereby the motor performs driving such that the arm traces such a tip position path needs to be provided to the motor as input data.
Now, if an inverse model can be obtained for predicting motor data serving as input data (control data) whereby certain tip position path data can be obtained as target values, using only teaching data made up of the set of motor data serving as input data and tip position path data serving as output data due to the motor data having been supplied to the motor, the inverse model can be used for arm control equipment for determining motor data corresponding to tip position path data which is the target value.
With arm control equipment serving as an inverse model for an arm, inputting tip position path data as input data to the robot allows the robot to use the arm control equipment to determine the corresponding motor data (control data). The robot then drives the motor thereof following the motor data, whereby the arm of the robot moves so as to trace a path corresponding to the tip position path data which is the input data.
Thus, if a forward model or inverse model can be obtained using only the set of input data and output data (i.e., teaching data), a robot which outputs output data corresponding to the respective input data can be readily configured, using forward and inverse models.
As for a method for obtaining such a forward model or inverse model as described above, there is modeling using a linear system.
With modeling using a linear system, as shown in FIG. 4 for example, with input data to the object of control at point-in-time t as u(t) and output data thereat as y(t), the relation between the output data y(t) and input data u(t), i.e., the object of control is approximated as a linear system obtained from Expression (1) and Expression (2)x(t+1)=Ax(t)+Bu(t)  (1)y(t)=Cx(t)  (2)
Here, x(t) is called a state variable of the linear system at the point-in-time t, with A, B, and C being coefficients. To facilitate description here, if we say that the input data u(t) and output data y(t) are one-dimensional vectors (scalar) and the state variable x(t) an n'th dimensional vector (wherein n is an integer value of 2 or higher in this case), A, B, and C are each matrices of constants obtained from an n×n matrix, n×1 matrix, and 1×n matrix, respectively.
With modeling using a linear system, the matrices A, B, and C are determined such that the relation between the observable input data u(t) and the output data y(t) observed when the input data u(t) is provided to the object of control satisfies the Expression (1) and Expression (2), thereby yielding a forward model of the object of control.
However, modeling using a linear system is insufficient for complicated objects of control, i.e., is insufficient for modeling an object of control having non-linear properties, for example.
That is to say, an actual object of control is complicated, and often has non-linear properties, but modeling the object of control by approximating a simple linear system results in great prediction error in the output data predicted as to the input data in a forward model or input data (control data) predicted as to the output data in an inverse model, so prediction with high precision is difficult.
Accordingly, as for a method to often a forward model or inverse model as to an object of control which has non-linear properties, there is a method for using a neural network to lean teaching data, i.e., a set of input data provided to the object of control and output data observed from the object of control when the input data is provided thereto. A neural network is a network configured by mutually connecting man-made elements imitating neurons (neurons), and can learn the relation between externally provided teaching data, i.e., the relation between input data and output data.
However, in order to suitably model the object of control with a neural network, there is the need for the size of the neural network to be great according to the complexity of the object of control. Increasing the size of the neural network markedly increases the time necessary for learning, and also stable learning becomes more difficult. This also holds true in the event that the order of dimension of the input data or output data is great.
On the other hand, in the event of obtaining a forward model or inverse model using only the set of input data and output data (teaching data), there is the need for learning to be performed using the teaching data, and for whether or not the teaching data falls under one of several patterns to be recognized. That is to say, there is the need for patterns of input data and output data serving as teaching data to be learned and recognized.
The technique for learning and recognizing patterns is generally called pattern recognition, and learning under pattern recognition can be classified into learning with a tutor (supervised learning) and learning without a tutor (unsupervised learning).
Supervised learning is a method wherein information is provided regarding to which class learning data of each pattern belongs (called “true label”), and learning data belonging to a pattern is learned for each pattern, with many learning methods using neural networks or the HMM (Hidden Markov Model) having been proposed.
FIG. 5 illustrates an example of supervised learning. With supervised learning, learning data to be used for learning is provided beforehand in anticipated categories (classes), such as categories of phonemes, phonetic categories, word categories, and so forth, for example. For example, in a case of learning audio data of voices “A”, “B”, and “C”, audio data for a great number of each of “A”, “B”, and “C” is prepared.
On the other hand, anticipated categories are prepared by category for models used for learning as well (models by which learning data of each category is learned). Now, models are defined by parameters. For example, HMMs or the like are used as a model for learning audio data. An HMM is defined by the probability of state transition from one state to another state (including the original state), an output probability density function representing the probability density of observed values output from the HMM, and so forth.
With supervised learning, learning of the models of each category (class) is performed using only the learning data of that category. That is to say, in FIG. 5, learning of the category “A” model is performed using only learning data of the category “A”, learning of the category “B” model is performed using only learning data of the category “B”, and learning of the category “C” model is performed using only learning data of the category “C”.
With supervised learning, there is the above-described need to use learning data of each category perform learning of a model of that category, so learning data is prepared for each category, learning data of that category is provided as to a model for learning the category, and thus a model is obtained for each category. Consequently, accordingly supervised learning, a template (a model of a class (category) represented by the true label) can be obtained for each class, based on the true label.
At the time of recognition, a template which most closely matches data which is the object of recognition (a template with the greatest likelihood) is obtained, and the true label of that template is output as the recognition result.
On the other hand, unsupervised learning is learning performed in a state wherein no true label is provided to learning data of each pattern, and is a learning method which uses a neural network or the like, for example. Unsupervised learning differs greatly from supervised learning in that no true label is provided.
Now, pattern recognition can be viewed as quantization of a signal space where data (signals) to be recognized by the pattern recognition is observed. Particularly, pattern recognition in cases wherein the data to be recognized is vector data may be called vector quantization.
With learning of vector quantization (codebook generation), a representative vector corresponding to a class (referred to as “centroid vector”) is situated in the signal space where the data to be recognized is observed.
A representative technique for unsupervised learning of vector quantization is the K-means clustering method. With the K-means clustering method, in an initial state, centroid vectors are randomly situated, a vector serving as learning data is assigned to a centroid vector at the closest distance, and the centroid vectors are updated by an average vector of the learning data assigned to the centroid vectors, with this process being repeatedly performed. Note that a group of centroid vectors is called a codebook.
Now, the method for accumulating a great number of learning data and using all to perform learning is called “batch learning”; K-means clustering is classified in batch learning. As opposed to batch learning, learning wherein each time learning data is observed the learning data is used to perform learning, thereby updating parameters (centroid vector components, output probability density functions defining an HMM, etc.) a little at a time is called “on-line learning”.
A known form of on-line learning is SOM (self-organization map) learning, proposed by Teuvo Kohonen. With SOM learning, the weight of an input layer and output layer of a SOM is gradually corrected (updated) by on-line learning.
That is to say, in a SOM, an output layer has multiple nodes, with weight vectors provided to each node of the output layer. In a case wherein the weight vector is a centroid vector, the SOM learning is vector quantization learning.
Specifically, with nodes in an output layer of a SOM, a node of which the distance between a weight vector and a vector serving as the learning data is determined to be the winning node matching that vector best, and the weight vector of the winning node is updated so as to be closer to the vector serving as the learning data. Further, weight vectors nearby the winning node are also updated so as to be closer to the learning data. Consequently, as learning progresses, nodes with similar weight vectors are placed so as to be closer to one another on the output layer, and dissimilar nodes distant one from another. Accordingly, a map is configured on the output layer, corresponding to a pattern included in the learning data, as if it were. This sort of learning wherein similar nodes (nodes of which weight vectors are similar) are grouped close to one another as learning processes so as to configure a map corresponding to a pattern included in the learning data is referred to as “self-organizing learning”, or “self organization”.
Now, with K-means clustering, only the vector closest to the learning data is updated, so the updating method thereof is called “WTA (winner-take-all)”. On the other hand, learning with a SOM is such that not only the weight vector of the nearest node to the learning data (winning node) but also weight vectors of nodes nearby the winning node are also updated, so the updating method thereof is called “SMA (soft-max adaptation)”. It is known that while WTA learning tends to fall into localized learning, while the problem of falling into localized learning can be improved with SMA learning.
For more about SOM, see “Self-Organization Map” by Teuvo Kohonen, published by Springer Japan, for example.
Now, research is being performed on a framework for a robot to acquire a structure for perceptive actions through the actions of the robot itself, so as to make the behavior (actions) of the robot in the real world more natural. Note that “perceptive actions” means that a robot or the like perceives (recognizes) an external state (including the state of the robot itself) and acts according to the results of the perception.
In order to case a robot to perform perceptive actions, there is the need to obtain appropriate motor data to serve as motor data supplied to the motor driving the robot, as to sensor data which a sensor detecting the external state outputs, for example.
Generally, sensor data output from a sensor, and motor data supplied to a motor, are both continuous time-sequence data. Also, robots which perform perceptive actions in the real word need to handle data with a great number of dimensions for the sensor data and motor data. Moreover, the behavior of sensor data and motor data handled by the robot is complex, and cannot readily be modeled with a linear system.
Now, the present assignee has already proposed a method for using a time-sequence pattern storage network configured of multiple nodes having a time-sequence pattern model representing a time-sequence pattern, which is a pattern of time-sequence data such as sensor data or motor data, to perform self-organizing learning of time-sequence data such as sensor data and motor data which are time-sequence systems of multi-dimensional vectors, and further to joint the nodes of a time-sequence pattern storage network which has learned time-sequence data which is input data with those of a time-sequence pattern storage network which has learned time-sequence data which is output data, so as to perceive an external state, and generate output data, based on input data, corresponding to actions the robot should take based on the results of perception (e.g., see Japanese Unexamined Patent Application Publication No. 2006-162898).
Now, a time-sequence pattern storage network is common with known SOMs in that it is configured of multiple nodes and can perform learning, and accordingly can be said to be a type of SOM. However, a time-sequence pattern storage network differs from known SOMs in that the nodes have time-sequence pattern models and that time-sequence patterns are held in storage structures of the time-sequence pattern models.