1. Field of the Invention
The present invention relates to a data processing apparatus, in particular, to a problem solver for inputting an initial state and a goal state in a state space defined by a particular problem and generating an action sequence executed from the initial state and the goal state.
2. Description of the Related Art
As a problem solver for inputting an initial state and a goal state in a state space defined by a particular problem and generating an action sequence executed from the initial state and the goal state, a general problem solver (GPS) is well known.
As shown in FIG. 1, the GPS has a model for applying an action (t) to a particular state (t) (namely, performing the action (t) in the state (t)) and predicting the resultant state (t+1) (this model is referred to as world model or forward model).
When the GPS generates an action sequence, the GPS obtains the difference between an initial state and a goal state from a state space defined by a problem. In addition, the GPS selects an action (operator) so that the difference becomes small.
Thereafter, as the next sub goal, the GPS applies the selected action so that application conditions of the action is satisfied. Thereafter, the GPS repeatedly detects the difference between a current state and a goal state and selects an appropriate action. As a result, when there is no difference, the GPS obtains an action sequence (namely, a plan) executed from the initial state to the goal state. Finally, the GPS executes the plan.
When an action sequence is generated using such a model (this process is also referred to as classical planning), the number of calculations becomes large (in other words, the calculation cost becomes high).
In addition, an immediate action deciding method (referred to as reactive planning method) is known. As shown in FIG. 2, the reactive planning method can be applied to a system that requires a real time operation in such a manner that an action(t) (abbreviated as a(t)) is directly obtained in a particular state(t) (abbreviated as s(t)). In this method, although the calculation cost for executing an action is low, an action cannot be changed as a goal state is changed. In other words, this method does not have flexibility.
As a technology for solving such a problem, a goal state is represented as a reward (or an effect). By predictively evaluating the middle of an action sequence, an action that is executed is changed corresponding to the learnt result. This technology is referred to as a reinforcement learning method.
As a typical algorithm of the reinforcement learning method, Q-learning method is known. In the Q-learning method, the mapping from a state s(t) to an action a(t+1) is changed corresponding to a reward obtained from the outside. As shown in FIG. 3, in the Q-learning method, a Q value Q(s(t)+a(t)) as a predictive reward corresponding to an action a(t+1) in a state s(t) is predicted by a Q module. An action a(t+1) is selected so that the Q value (predictive reward) becomes large. As a result, an action can be reasonably selected.
An actor-critic model is another famous model of the reinforcement learning method. In the actor-critic model, a V value value (s(t)) is predicted as a predictive reward corresponding to a state s(t) by a critic module. Corresponding to an error of the obtained predictive reward, the selective probability of an action that is executed is changed by an actor module.
Thus, in any reinforcement learning model, an action that is executed can be quickly decided.
For details about the planning method, refer to xe2x80x9cArtificial Intelligence, Modern Approachxe2x80x9d, Russel, S. J. and Norvig, P., Prentice-Hall International, Inc. For details about the reinforcement learning method, refer to xe2x80x9cReinforcement Learning, A Surveyxe2x80x9d, Kaelbling, L. P., Littman, M. L., and Moore, A. W., J. Artificial Intelligence Research, Vol. 4, pp 237-285 (1996).
As described above, the artificial planning method has a problem about high cost in calculations for executing an action.
Although the reactive planning method allows the calculation cost in executing an action to be more reduced than the classical planning method, an action cannot be changed as a goal state is changed. Thus, the reactive planning method does not have flexibility.
In addition, the reinforcement learning method allows an action to be changed as a goal state is changed. However, when the goal state is changed, the learnt results cannot be basically used. Thus, the learning process should be repeated. However, since the learning amount (cost) necessary for a predetermined goal state is large. As a result, the change of the goal state is restricted. Thus, the flexibility of the reinforcement learning method is low.
The present invention is made from the above-described point of view. An object of the present invention is to provide a problem solver that allows the calculation cost in executing an action to be reduced and the flexibility against a change of a goal state to be improved.
The present invention is a problem solver for generating an action sequence executed from an initial state to a goal state in a state space defined by a particular problem.
The problem solver of a first aspect of the present invention comprises a cognitive distance learning unit learning a cognitive distance that represents a cost acted on the environment of the state space, the cost being spent in an action sequence executed from a first state in the state space to a second state that is different from the first state and a next action deciding unit deciding a next action contained in the action sequence that has to be executed in a particular state to attain the goal state in the state space based on the cognitive distance learnt by the cognitive distance learning unit.
The problem solver of a second aspect of the present invention comprises a cognitive distance learning unit learning a cognitive distance that represents a cost acted on the environment of the state space, the cost being spent in an action sequence executed from a first state in the state space to a second state that is different from the first state and a next state deciding unit deciding a next state reachable in the execution of a particular action contained in the action sequence based on the cognitive distance learnt by the cognitive distance learning unit, the particular action having to be executed in a particular state to attain the goal state in the state space.
Since the present invention has the structures of the first and second aspects, the cognitive distance learning unit learns a cognitive distance from any state and a goal state in a state space. Based on the learnt result of the cognitive distance, the cognitive distance learning unit generates a action sequence. Unlike with the predictive evaluation of the reinforcement learning method, when a goal state is changed while the cognitive distance is being learnt, the changed goal state is immediately reflected to the learnt result. The next action deciding unit can decide the next action by simply comparing cognitive distances as the learnt result of the cognitive distance learning unit. The next state deciding unit can decide the next state by simply comparing cognitive distances as the learnt result of the cognitive distance learning unit. Thus, the calculation cost for executing an action can be suppressed. In addition, the flexibility against a changed goal state can be secured.