1. Field of the Invention
The present invention relates to a data processing apparatus, a data processing method, and a computer program, and, more particularly to a data processing apparatus, a data processing method, and a computer program for allowing an agent such as a robot to set a target in a generalizable range and quickly take an action for attaining the target.
2. Description of the Related Art
For example, as a method of causing an agent that can act such as a virtual character displayed on a display or a robot in a real world to adapt to an environment and act, there is a method with which a designer of the agent programs actions of the agent assuming an environment.
However, with the method of programming actions of the agent, the agent may be unable to take actions not programmed by the designer.
Therefore, there is proposed a method of providing, in an agent, a learning device that learns actions of the agent according to the environment around the agent and acquiring actions adapted to the environment using the learning device (see, for example, JP-A-2006-162898).
As a method of causing the learning device to acquire actions, for example, there is a method of explicitly and directly teaching actions to the learning device offline. “Explicitly and directly teaching actions” means, for example, an action of a user placing a ball in front of a robot as an agent and directly moving arms of the robot to roll the ball to the left and right.
However, with the method of explicitly and directly teaching actions, it is difficult to adapt the robot to a dynamic environment and the like (including an unknown (unlearned) environment) that change every moment.
To allow the agent to take actions adapted to the dynamic environment and the like, (the learning device of) the agent needs to appropriately collect new learning data for learning actions and perform learning using the learning data to acquire a new action.
In other words, the agent itself needs to repeatedly autonomously take a new action and learn a result of the action to thereby acquire a model for taking actions adapted to an environment.
As a method with which an agent autonomously acquires a new action, there is a method of searching for an action such that sensor data representing a state of an environment, which is obtained by sensing the environment with a sensor for sensing a physical quantity, reaches a target value and learning a result of the action (see, for example, Toshitada Doi, Masahiro Fujita, and Hideki Shimomura, “Intelligence Dynamics 2, Intelligence having a Body, Coevolution of Brain Science and Robotics”, Springer Japan KK, 2006) (Non-Patent Document 1).
With the method described in Non-Patent Document 1, a target value of sensor data is determined and the agent repeatedly searches for an action in which the sensor data reaches the target value. Consequently, the agent autonomously develops (a model learned by) a learning device and acquires a new action in which a target value is obtained.
To acquire (a model for taking) a new action, (the learning device of) the agent needs to have a function of learning new learning data for taking the new action (a function for performing so-called additional learning).
Further, to acquire a new action, the agent needs to have a first function of searching for an action for attaining an unknown target, i.e., an action in which sensor data reaches a target value and a second function of determining a target necessary for the agent to expand actions, i.e., determining a target value of sensor data.
As a representative method of realizing the first function, there is reinforcement learning. In the reinforcement learning, in general, a ε-greedy method is used to search for an action. In the ε-greedy method, in the search for an action, a random action is selected at a certain probability ε and a best action in the past (e.g., an action in which sensor data closest to a target value is obtained) is selected at a probability 1-ε.
Therefore, in the ε-greedy method, the search for a new action is performed in so-called random search.
In the search for an action, action data for causing the agent to act is generated and the agent takes an action according to the action data. The action data is, for example, data for driving an actuator that moves parts of a body such as arms and legs of a robot. The action data includes data used for causing the agent to take various actions such as data for causing a light source equivalent to an eye to emit light and data for generating a composite tone as sound.
Concerning an agent that performs only a simple action, action data for causing the agent to act is discrete values that take a small number of values.
For example, concerning an agent that moves to a destination while selecting one of two paths selectable at a branch point of the paths, action data for the agent to take an action for selecting a path at the branching point is, for example, discrete values that take two values, 0 and 1, representing the two paths.
Besides, action data for causing the agent that performs only a simple action to act is, for example, data (a vector) with a small number of a dimension and is time series data with small data length.
On the other hand, action data for causing an agent that can take a complicated action to act is discrete values that take a large number of values or continuous values. Further, the action data for causing the agent that can take a complicated action to act is data with a large number of a dimension and is time series data with large data length.
When action data is the continuous values (including the discrete values that take a large number of values), the data with a large number of a dimension, or the time series data with large data length, in the random search performed by the ε-greedy method, extremely many trials (searches for actions) are necessary until the agent becomes capable of taking an action for attaining a target.
In the Non-Patent Document 1, a target value of sensor data is determined at random and the search for an action in which sensor data of the target value is obtained (action data) is performed according to a retrieval algorithm called A*.
In the Non-Patent Document 1, a method of setting a target at random is adopted as a method of realizing the second function for performing determination of a target necessary for the agent to expand actions, i.e., determination of a target value of sensor data.
However, when a target is set at random, in some case, a target that the agent may be unable to attain is set. In an attempt to attain such a target that the agent may be unable to attain, virtually, useless search (search for an action) could be performed.
In other words, when a target is set at random, a target exceeding a generalizable range by a model of the agent may be set.
Further, there is provided a method of predicting an error of a prediction value of sensor data and realizing, using a prediction value of the error as a part of input to a function approximator as a model for learning an action, a curiosity motive for searching for a new action with a target of setting the prediction value of the error to a maximum value 1.0 (see, for example, J. Tani, “On the Dynamics of Robot Exploration Learning,” Cognitive Systems Research, Vol. 3, No. 3, pp. 459-470, (2002)).
However, when it is the target to set the prediction value of the error to the maximum value 1.0, i.e., when a target value is set to a certain fixed value, the target value may be a value exceeding performance of generalization of the function approximator.
For example, when learning of the function approximator is performed by using time series data of two patterns as learning data, in the function approximator, memories of the two patterns interfere with each other and the memories are shared. As a result, the function approximator can generate time series data not learned in the past, for example, time series data of an intermediate pattern of the two patterns used as the learning data.
The ability for sharing memories of plural patterns of the function approximator is generalization. According to the generalization, it is possible to generate time series data of a pattern similar to a pattern used as the learning data.
However, even the function approximator having the generalization ability may not be able to generate time series data of a pattern exceeding performance of the generalization, i.e., for example, time series data of a pattern completely different from the pattern used as the learning data.
Therefore, when the target value is a value exceeding the performance of the generalization of the function approximator, it is difficult to perform search using the generalization of the function approximator. As a result, it may be difficult to search for an action of moving closer to the target value. The same holds true when a target is set at random as described in Non-Patent Document 1.
There is proposed an agent that acts, in order to expand an action range of the agent, to increase an error from a memory of an action in the past and, on the other hand, reduce an error from a memory of an action for returning to a home position such that the agent does not enter an unknown area too deeply (see, for example, JP-A-2002-239952).
The agent described in JP-A-2002-239952 predicts sensor data and action data that would be obtained if the agent acts according to action plan data and calculates a reward with respect to the sensor data and the action data. Action plan data for taking an action is selected out of action plan data stored in the agent in advance. The agent acts according to the selected action plan data.
Therefore, it is difficult for the agent described in JP-A-2002-239952 to take an action other than actions conforming to the action plan data stored in advance.