1. Field of the Invention
The present invention relates to a neural network element configured to learn contents of an output signal output in response to an input signal, and to a learning scheme using the neural network element.
2. Description of the Related Art
As a conventional technology of method for learning an input-output relation of a neural network, there are back-propagation, a self-organizing map, Hebb learning, TD (temporal differential) learning, or the like. These learning methods enable a neural network element (neuron), which is a component of the neural network, to learn the correspondence between an output and its relevant input pattern. However, in many cases, learning process such as updating a coupling coefficient between the elements depends on whether or not the input and output at the present moment fire at the same time, and the past historical information was not reflected to the learning. In other words, most of conventional methods are not theoretically supported with regard to learning of temporal information process. It is the TD learning only, which meet the learning of the temporal information. In the aforesaid methods (with regard to the TD learning, refer to A. G. Barto et al., “Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problems”, IEEE Transactions on Systems, Man and Cybernetics, Vol. SMC-13, No. 5, pp. 834-846, 1983, for example).
However, the TD learning has a problem that this takes a long time for learning. A concept of eligibility is introduced to the TD learning for acquiring the temporal information. Since the eligibility, which is like a history of input signals, depends on the simultaneous events of the input signal and the output signal, new history is not stored unless the input exists at timing that the neural network element fires. In other words, the eligibility is imperfect as the temporal information of the input signal; for example, a long-term firing process or the like is not stored. Since an update formula of the coupling coefficient depends on both of a reinforcement signal and the eligibility, the coupling coefficient is reinforced only by the reinforcement signal if only there was firing synchronous with the input in the past, even if there is not actual fire at that moment. For these reasons, a lot of learning steps are needed at the beginning of the learning.
Shigematsu et al. suggest a neural network element capable of learning the temporal information (see Japanese Patent Application Publication (JP-B) No. 7-109609). In this model, temporal information of the input is stored in the history, irrespective of the firing of the output. Since coupling weight is updated only at the moment of firing, this model is characterized in learning efficiency thereof.
However, since the learning of the temporal information depends only on the history of the input and the output in the neuron network model of Shigematsu et al., this cannot control a direction of the learning, when used in a system. In addition, there was a defect in which information learning process on a course of temporal direction ambiguously progresses because temporal information history depends only on the input and an attenuation coefficient of the history. There is a problem that the learning method of the model of Shigematsu et al. cannot be applied to the reinforcement learning which is the mainstream of a learning control scheme dealing with the temporal information.
Meanwhile, learning for generating a desired output signal from the input signal is roughly classified into two categories; a “supervised learning” and a “unsupervised learning”. The “supervised learning” is a method to advance the learning of the system by giving a desirable output signal to the input signal as a teacher signal, including the back-propagation method or the like. The “unsupervised” learning is a method to perform the learning by using only the input signal. The system learns to generate a similar output signal as the signal used in the learning or the signal similar to the same when these are input.
When a learning control system is operated in an actual environment, it is difficult to give an appropriate teacher signal to the input signal in advance. This is because a prior definition is complicated because of a complexity and nonlinearity of an input-output relationship. Therefore, an unsupervised learning is suitable in the use of the learning control system in the actual environment. The reinforcement learning to advance the learning based on reward acquired as a result of its own action of the system has been widely used, because of efficacy thereof in which the learning direction can be controlled by a manner of giving of the reward.
In the reinforcement learning, the system itself may exploratory act to advance the learning in the direction to acquire more reward. Conversely, however, since the learning depends on the reward acquired as a result of exploration, the reinforcement learning tends to fall into local minimum. That is, the reinforcement learning is effective in a domain relatively simple, but when a variation of the input and output increases, it is highly possible that the learning cannot be converged toward the best direction.
Doya et al. suggest architecture to add a state predictor to a plurality of Actor-Critics by improving Actor-Critic architecture, which is a method of the reinforcement learning (see Japanese Patent Application Laid-Open (JP-A) No. 2000-35956). The learning is performed to predict any possible circumstance by the state predictor, and to choose the optimal action according to the predicted state. Further by calculating levels of responsibility of the plurality of Actor-Critics and using the same in the learning, a range of action was expanded to solve the local minimum problem.