1. Field of the Invention
The present invention relates to a learning apparatus and learning method, a generation apparatus and generation method, and a computer program. In particular, the present invention relates to a learning apparatus and learning method, a generation apparatus and generation method, and a computer program for easily acquiring one of a high-precision forward model and a high-precision backward model of an object to be controlled, with an input thereto and an output therefrom being observable, the input and the output being time series data such as a sound.
2. Description of the Related Art
FIG. 1 illustrates a forward model and a backward model of an objected to be controlled.
In response to time series data as input data, an object to be controlled provides output data as another time series data. Although details (interior) of the control are unknown, the input data to the object and the output data the output provides in response to the input data are observed.
The input data input to the object and the output data the object provides in response to the input data are observable can be any physical quantity as long as these data units are observable. The object can be anything as long as the object can receive the input data and outputs the output data in response to the input data.
For example, a ball, a musical instrument, an automobile, a gas heater, or the like can be an object to be controlled. With a force, as input data, applied onto a ball, a position and speed of the ball are obtained. For example, when one of a steering wheel, an acceleration pedal, a brake pedal is operated as an input in an automobile, a position and speed of the automobile are obtained as output data. For example, in a gas heater, fire power may be adjusted as input data, and a room temperature is obtained as output data in response to the input data.
If input data is applied to the object to be controlled, and output data is obtained in response, the object is modeled as a forward model.
The forward model outputs an inferred value of output data in response to the input data. In the forward model, the output data that could be obtained from the object can be inferred without applying actual input data to the object.
In a backward model, a target value of the output data to be obtained from the object to be controlled is determined. The input data to be applied to the object is then inferred to obtain the output data having the target value. The forward model can be considered as mirror imaging from the input data to the output data while the backward model is considered as reverse mirror image from the output data to the input data.
The input data to obtain the output data having the target value on the backward model is referred to as control data as necessary.
The forward model and the backward model can be implemented in the configuration of a robot.
For example, a robot may be equipped with a microphone and a camera to receive audio data and video data and may also equipped with a loudspeaker and an actuator (motor) to output a sound (audio data). In response to motor data (motor signal), the motor is operated to move an arm.
In response to the audio data such as the audio data and the video data, such a robot outputs the audio data as the output data, and the motor data as the output data to move the arm. In the known art, the robot including a voice recognition device and an image recognition device recognizes the voice data and the video data input thereto. The robot is preprogrammed (designed) to output what audio data to output and what motor data to output in response to the recognition result.
If the forward model is used as shown in FIGS. 2A and 2B, a robot outputting desired audio data and motor data for a desired arm movement in response to the audio data and the video data is assumed as an object to be controlled. The robot assumed to be the object to be controlled (hereinafter referred to as an assumed robot as necessary) forms a forward model. If the robot is caused to learn the relationship between the input data to the robot and the output data from the robot, the robot becomes an assumed robot as a forward model.
More specifically, a set of the input data, such as the audio data and the video data to be input to the robot, and the output data, such as the audio data and the motor data the robot is to output in response to the input data, is prepared, and then actually applied to the robot. If an assumed robot of backward model providing output data responsive to input is inferred (output) using an externally supplied set of input data and output data (hereinafter referred to as supervising data), output data such as desired audio data and desired motor data can be output in response to the input data such as audio data and the video actually input.
With the backward model, an arm controller controlling an arm as an object to be controlled of a robot is configured as shown in FIGS. 3A and 3B.
More specifically, the robot arm is moved by a motor that runs in response to motor data as input data. As a result, a distal end of the arm moves in position. With the origin placed at the center of gravity of the robot, the forwarding direction of the robot is aligned with the x-axis, the right direction (viewed from the robot) is aligned with the y-axis, and the upward direction is aligned with the z-axis. The position of the distal end of the arm is determined in the three (x,y,z) coordinates. In response to the motor data, the motor runs, and moves the end of the arm in position. As a result, the end of the arm forms a trajectory. A sequence of coordinates of the trajectory the arm end follows is referred to as end position trajectory data.
In order to cause the arm to follow a desired end position trajectory, namely, in order to output desired end position trajectory data as output data, motor data driving the motor to move the arm along the end position trajectory needs to be supplied to the motor as input data.
Using only supervising data as a set of motor data as input data and end position trajectory data as output data responsive to the motor data given to the motor, motor data as input data (control data) achieving end position trajectory data (output data) as a target value is inferred. If a backward model of the arm is determined, the backward model can be used as an arm controller that determines motor data responsive to the end position trajectory data as a target value.
If the end position trajectory data is input as the input data to the robot with the arm controller as the backward model of the arm, the robot determines the corresponding motor data (control data) using the arm controller. With the robot driving the motor in accordance with the motor data, the robot arm moves along the trajectory defined by the end position trajectory data as the input data.
If the forward model and the backward model are determined using only the set of the input data and the output data (supervising data), a robot outputting the output data responsive to the input data is easily configured based on the forward model and the backward model.
A modeling method using a linear system is available as a method of determining a forward model and a backward model of an object to be controlled.
In the modeling based on the linear system, input data u(t) input to the object to be controlled at time t and output data y(t) are respectively represented by equations (1) and (2). In other words, the object to be controlled is approximated as a linear system by equations (1) and (2):x(t+1)=Ax(t)+Bu(t)  (1)y(t)=Cx(t)  (2)wherein x(t) is referred to as a state variable of the linear system at time t, and A, B, and C are coefficients. For simplicity of explanation, let the input data u(t) and the output data y(t) be a one-dimensional vector (scalar quantity), and the state variable x(t) be an n-dimensional vector, and A, B, and C are respectively constants represented by an nxn matrix, an n×1 matrix, and a 1×n matrix (n is 2 or larger integer).
In the modeling based on the linear system, the forward model of the object to be controlled is obtained by determining matrices A, B, and C so that the relationship between the observable input data u(t) and the output data y(t), obtained when the input data u(t) is input to the object to be controlled, satisfies equations (1) and (2).
The linear system modeling technique is not sufficient to model a complex object, such as the one having non-linear characteristics.
Actual objects to be controlled are complex, and have occasionally non-linear characteristics. If such an object is modeled by approximating the object as a simple linear system, output data the forward model infers in response to the input data and input data (control data) the backward model infers in response to the output data are subject to large inference error, and high-accuracy inference cannot be performed.
Methods of obtaining a forward model and a backward model of an object having non-linear characteristics are available. For example, supervising data, namely, a set of input data supplied to an object to be controlled and output data observed with the input data supplied to the object, is learned using a neural network. The neural network is produced by mutually linking artificial elements simulating neurons, and can learn a relationship of supervising data supplied from the outside, namely, a relationship between the input data and the output data.
To model an object using a neural network, the neural network needs to be scaled up in size in accordance with the complexity of the object to be controlled. As the scale of the neural network becomes large, time required for learning is substantially increased. Reliable learning is difficult to perform. The same is true if the number of dimensions of input data and output data is large.
When the forward model and the backward model are determined using only the set of the input data and the output data (supervising data), learning is performed using the supervising data in order to recognize which of several patterns the supervising data matches. More specifically, the pattern of the input data and the output data as the supervising data needs to be learned and recognized.
Techniques recognizing a pattern by learning are generally called pattern recognition. The learning techniques by pattern recognition are divided into supervised learning and unsupervised learning.
In the supervised learning, information concerning class to which learning data of each pattern belongs to is provided. Such information is called correct-answer label. The learning data belonging to a given pattern is learned on a per pattern basis. Numerous learning methods including template matching, neural network, and hidden Markov model (HMM), have been proposed.
FIG. 5 illustrates known a supervised learning process.
In the supervised learning, learning data for use in learning is prepared according to assumed category (class), such as phoneme category, phonological category, or word category. To learn voice data of pronunciations of “A”, “B”, and “C”, a great deal of voice data of pronunciations of “A”, “B”, and “C” is prepared.
A model used in learning (model learning data of each category) is prepared on a per category basis. The model is defined by parameters. For example, to learn voice data, an HMM is used as a model. The HMM is defined by a state transition probability of transitioning from one state to another state (including an original state) or an output probability density representing the probability density of an observed value output from the HMM.
In the supervised learning, the learning of each category (class) is performed using learning data of that category alone. As shown in FIG. 5, a model of category “A” is learned using learning data of “A” only, and a model of category “B” is learned using learning data of “B” only. Likewise, a model of category “C” is learned using learning data of category “C” only.
In the supervised learning, the learning of the model of a category needs to be performed using the learning data of that category. The learning data is prepared on a category by category basis, and the learning data of that category is provided for learning of the model of that category. The model is thus obtained on a per category basis. More specifically, in the supervised learning, a template (model of a class (category) represented by a correct-answer label) is obtained on a per class basis.
During recognition, a correct-answer label of a template (having the maximum likelihood) most appropriately matching data to be recognized is output.
The unsupervised learning is performed with no correct-answer label provided to the learning data of each pattern. For example, learning methods using template matching and neural net are available in the unsupervised learning. The unsupervised learning is thus substantially different from the supervised learning in that no correct-answer label is provided.
The pattern recognition is considered as a quantization of signal space in which data (signal) to be recognized in the pattern recognition is observed. If the data to be recognized is a vector, the pattern recognition is referred to as a vector quantization.
In the vector quantization learning, a representative vector corresponding to class (referred to as a centroid vector) is arranged in the signal space where the data to be recognized is placed.
K-means clustering method is available as one of typical unsupervised learning vector quantization techniques. In the K-means clustering method, the centroid vector is placed appropriately in the initial state of the process. A vector as the learning data is assigned to the centroid vector closest in distance thereto, and the centroid vector is updated with a mean vector of the learning data assigned to the centroid vector. This process is iterated.
Batch learning is also known. In the batch learning, a large number of learning data units is stored and all learning data units are used. The K-mean clustering method is classified as batch learning. In online learning, as opposed to the batch learning, learning is performed using learning data each time the learning data is observed, and parameters are updated bit by bit. The parameters include a component of the centroid vector and a output probability density function defining HMM.
Self-organization map (SOM), proposed by T. Kohonen, is well defined as the online learning. In the SOM learning, the weight of link between an input layer and an output layer is updated (corrected) bit by bit.
In the SOM, the output layer has a plurality of nodes, and each node of the output layer is provided with a link weight representing the degree of link with the input layer. If the link weight serves as a vector, the vector quantization learning can be performed.
More specifically, a node having the shortest distance between the vector as the link weight and the vector as the learning data is determined as a winner node from among the nodes of the output layer of the SOM. Updating of the vector is performed so that the vector as the link weight of the winner node becomes close to the vector as the learning data. The link weight of a node of the vicinity of the winner node is also updated so that the link weight becomes a bit closer to the learning data. As learning process is in progress, nodes are arranged in the output layer so that nodes having similar vectors as link weights are close to each other while nodes not similar to each other are far apart from each other. A map corresponding to a pattern contained in the learning data is thus organized. As learning is in progress, a map corresponding to the learning data containing similar nodes (namely, nodes having similar vectors as link weights) in close vicinity is produced. This process is referred to as self-organization.
The vector of the link weight obtained as a result of learning is considered as a centroid vector arranged in the signal space. In the K-mean clustering technique, only a vector closest in distance to the learning data is updated, and the method of updating in that way is referred to as winner-take-all (WTA). In contrast, in the SOM learning, not only the node closest to the learning data (winner node) but also the node of the vicinity of the winner node is updated in link weight. The method of updating is referred to as soft-max adaptation (SMA). The learning results of the WTA learning tends to be subject to localized solution while the SMA learning improves the problem of being subject to localized solution.
The SOM learning is described in the paper entitled “Self-Organizing Feature Maps” authored by T. Kohonen, Springer-Verlag Tokyo.
The above SOM and neural gas algorithm provides unsupervised learning applicable to a vector as a static signal pattern, namely, data having a fixed length. The SOM cannot be directly applied to time series data such as voice data because voice data is variable in length and dynamic in signal pattern.
In one proposed technique, a higher dimensional vector is defined by connecting consecutive vector series (with consecutive vector components handled as one vector component), and time series vectors as time series data are thus handled as a static signal pattern. Such a technique cannot be directly applied to variable-length time series data, such as voice data.
An HMM technique is available as one of widely available techniques for pattern recognition of time series data, such as recognizing voice data in voice recognition (as disclosed by Laurence Rabiner, and Biing-Hwang Juang in the book entitled “Fundamentals of Speech Recognition” NTT Advanced Technologies).
HMM is one of state transition probability models having state transitions. As previously discussed, HMM is defined by a state transition probability and an output probability density function at each state. In the HMM technique, statistical characteristics of time series data to be learned are modeled. Mixture of normal distributions is used as the output probability density function defining HMM. Baum-Welch algorithm is widely used to infer parameters of HMM (the parameters are the state transition probability and the output probability density function).
The HMM technique finds applications in a wide range from isolated word recognition, already put to practical use, to large vocabulary recognition. The HMM learning is typically a supervised learning, and as shown in FIG. 1, learning data with a correct-answer label attached thereto is used in learning. The HMM learning for recognizing a word is performed using learning data corresponding to that word (voice data obtained as a result of pronouncement of that word).
The HMM learning is supervised learning, and performing the HMM learning on learning data having no correct-answer label attached thereto is difficult, i.e., unsupervised HMM learning is difficult.