The present invention relates to an agent learning machine. More particularly, the invention relates to a highly-adaptable novel agent learning machine which, in a control of a physical system such as robot, automobile and airplane or in a program for carrying out, instead of human being, information search, user response, resource allocation, market dealing and so on, can deal with nonlinearity or nonstationarity of an environment such as a control object or a system, and can switch or combine, without using prior knowledge, behaviors optimum to various states or modes or the environment, and thus can achieve flexible behavior learning.
Most of conventional learning systems resolve a problem of xe2x80x9csupervised learningxe2x80x9d of how to realize a desired output and a time pattern thereof specified by a human being. However, in many problems in the actual world, a correct output is unknown, thus a framework of supervised learning is not applicable.
A system of learning a desired output and a time series thereof by performing, without being taught specifically of what is the correct output, an interactive operation in trial and error with an environment such as a control object has been researched and developed under a framework of xe2x80x9creinforcement learningxe2x80x9d (refer to R. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, MIT Press, 1998). In general, as exemplified in FIG. 1, the reinforcement learning system comprises a state evaluator (B) for learning a state evaluation x(t) based on a reward r(t) provided from an environment (A), and an action generator (C) for learning an action output u(t) pertinent to the environment (A) based on the state evaluation x(t).
Heretofore, an algorithm of the reinforcement learning has been applied to a control of a mobile robot or an elevator, an allocation of communication channels or programs of games (refer to R. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, MIT Press, 1998). However, when an action of the environment (A) has nonlinearity, there poses a problem in which the state evaluator (B) or the action generator (C) must carry out approximation of extremely complicated nonlinear functions and a very long period of time is required in learning. Further, when an action of the environment (A) has nonstationarity, learning of optimum behavior cannot be achieved.
Meanwhile, in the field of adaptive control, there has frequently been used a method of preparing multiple control circuits tuned for different operational points or operational modes of the environment and switching these control circuits as occasion demands, with regard to a system having nonlinearity or nonstationarity such as a robot or an airplane. However, according to the conventional method, it is necessary to determine, based on prior knowledge, what control circuits are to be prepared and under what conditions these are to be switched. Thus, the conventional method lacks in adaptability.
Various technologies have been proposed in order to resolve such problems.
For example, there is proposed, in R. A. Jacobs, at al, xe2x80x9cAdaptive mixtures of local expertsxe2x80x9d, Neural Computation, 3, 79-87, 1990, a method of realizing approximation of complicated nonlinear functions by weighting and switching outputs from a plurality of multiple layer neural networks by an output of a multiple layer neural network referred to as a gating circuit. Further, there has been proposed, in T. W. Cacciatore and S. J. Nowlan, xe2x80x9cMixtures of controllers for jump linear and non-linear plants.xe2x80x9d Neural Information Processing Systems, 6, Morgan Kaufmann, 1994 and also in H. Gomi and M. Kawato xe2x80x9cRecognition of manipulated objects by motorlearning with modular architecture networksxe2x80x9d Neural Networks, 6, 485-497, 1993 and also in Japanese Patent Laid-Open No. 06-19508 and Japanese Patent Laid-Open No. 05-297904, that a multiple layer neural circuit having a gating circuit(D) is applied to adaptive control as exemplified in FIG. 2. However, in reality, it is very difficult to realize learning of respective modules and learning of a gating circuit cooperatively.
Furthermore, there has been proposed a nonlinear control by a pair of predicting circuits and a control circuit in K. Narendra, et al. xe2x80x9cAdaptation and learning using multiple models, switching, and tuningxe2x80x9d IEEE Control Systems Magazine, June, 37-51, 1995. However, the control is carried out by a single module providing least prediction error and no consideration is given to a flexible combination. Further, all of these assume only a framework of supervised learning and accordingly, the applicable range is limited.
In K. Pawelzik, et al., xe2x80x9cAnnealed competition of experts for a segmentation and classification of switching dynamics.xe2x80x9d Neural Computation, 8, 340-356, 1996, there has been proposed combination and switching of prediction modules based on a posterior probability of a signal source. However, no consideration is given to a combination with a control circuit.
The present invention has been invented in view of the foregoing circumstances, and it is an object of the invention to resolve the problem of the conventional technologies and to provide a novel highly-adaptable agent learning machine which can change or combine, without being provided any specific teacher signal, behaviors optimum to states of operational modes of various environments and perform behavior learning flexibly without using prior knowledge, in an environment having nonlinearity or nonstationarity, such as a control object or a system.
In order to solve the foregoing problems, the present invention provides an agent learning machine comprising a plurality of learning modules each comprising a set of a reinforcement learning system for working on an environment and determining an action output for maximizing a reward provided as a result thereof, and an environment predicting system for predicting a change in the environment, wherein there are calculated responsibility signals each having a value such that the smaller a prediction error of the environment predicting system of each of the learning modules, the larger the value, and the output of the reinforcemnt learning system is weighted in proportion to the responsibility signal to thereby provide the action to the environment.
According to a second aspect of the invention, there is provided the agent learning machine wherein learning of either or both of the reinforcement learning system and the environment predicting system of the learning module is carried out in proportion to the responsibility signal.
According to a third aspect of the invention, there is provided the agent learning machine wherein a reinforcement learning algorithm or a dynamic programming algorithm is used in learning the reinforcement learning system.
According to a fourth aspect of the invention, there is provided the agent learning machine wherein a supervised learning algorithm is used in learning the environment predicting system.
According to a fifth aspect of the invention, there is provided the agent learning machine wherein the reinforcement learning system includes a state evaluator and an action generator.
According to a sixth aspect of the invention, there is provided the agent learning machine, wherein at least one of a linear model, a polynomial model and a multiple layer neural network is used as means for approximating a function of the state evaluator.
According to a seventh aspect of the invention, there is provided the agent learning machine wherein at least one of a linear model, a polynomial model and a multiple layer neural network is used as means for approximating a function of the action generator.
According to an eighth aspect of the invention, there is provided the agent learning machine wherein the environment predicting system includes either or both of the state predictor and a responsibility signal predictor.
According to a ninth aspect of the invention, there is provided the agent learning machine wherein at least one of a linear model, a polynomial model and a multiple layer neural network is used as means for approximating a function of the state predictor.
According to a tenth aspect of the invention, there is provided the agent learning machine wherein at least one of a linear model, a polynomial model and a multiple layer neural network is used as means for approximating a function of the responsibility signal predictor.