1. Field of the Invention
The present invention relates to intelligent controllers, and in particular, to intelligent controllers employing reinforcement learning techniques.
2. Description of the Related Art
Intelligent controllers are finding increasing use today to control complex systems. For example, such a controller used for assisting servomechanism controllers is disclosed in commonly assigned U.S. patent application Ser. No. 07/585,494, entitled "Intelligent Servomechanism Controller Using a Neural Network," the specification of which is incorporated herein by reference. Further applications include control systems developed for assisting robotic mechanisms.
Refinement of intelligent controllers has been proceeding apace. One example of refinement has been the use of "approximate reasoning" based controllers, e.g. controllers using "fuzzy" logic. A second example of refinement has been the use of reinforcement learning, while a third example is the use of both approximate reasoning and reinforcement learning in combination. However, heretofore developed reinforcement learning schemes have had problems in learning and generating proper control signals, particularly for complex plants such as those typically used in robotic mechanisms.
As is known, approximate reasoning based control is a technique which applies the knowledge of human experts in developing controller rules. Such controllers generally use "fuzzy set" theory. Reinforcement learning, which has a significant role in theories of learning by animals, has been developed in the theory of learning automata and has been applied to various control applications. These two schemes have been combined under the mnemonic "ARIC" to develop approximate reasoning based controllers which can learn from experience and exhibit improved performance. However, reinforcement learning has several problems, particularly in learning and generating proper control signals.
Reinforcement learning is used in applications where the target, or desired output, state is not exactly known. For example, if properly controlled, a robot arm designed to follow a desired trajectory path will do so while requiring a minimum amount of energy. However, the necessary action, e.g. control signal, to ensure that the energy will be minimized over time is unknown, although knowledge of the desired trajectory is known. Optimal control methods might be available if a model of the plant (i.e. the robot arm) and an accurate performance measure exist. However, for a complex plant, a true model is very difficult, if not possible, to obtain. Thus, for the minimization of energy, a learning system such as a reinforcement learning based neural network is useful.
The biggest challenge of such networks is how to link present action, e.g. generation of control signals, with future consequences or effects. Either reinforcement or supervised learning can be used in designing the controller for the robot arm to cause it to follow the proper trajectory when a target or desired output is known. However, supervised learning would be more efficient compared to reinforcement learning, since the learning takes a great deal of time in networks using the latter.
A reinforcement learning based controller is capable of improving performance of its plant, as evaluated by a measure, or parameter, whose states can be supplied to the controller. Although desired control signals which lead to optimal plant performance exist, the controller cannot be programmed to produce these exact signals since the true, desired outputs are not known. Further, the primary problem is to determine these optimal control signals, not simply to remember them and generalize therefrom.
Accordingly, the task of designing a reinforcement learning based controller can be divided into two parts. First, a critic network must be constructed which is capable of evaluating the performance of the subject plant in a way which is both appropriate to the actual control objective, and informative enough to allow learning. Second, it must be determined how to alter the outputs of the controller to improve the performance of the subject plant, as measured by the critic network.
Referring to FIG. 1, a conventional reinforcement learning based control system includes a plant (e.g. robotic mechanism), an action network and a critic network, connected substantially as shown. A plant performance signal output by the plant and indicative of the performance of the plant is received by both the critic and action networks. The action network provides a plant control signal for controlling the plant. The critic network receives both the plant performance and plant control signals, and provides a reinforcement signal to the action network indicating how well the plant is performing as compared to the desired plant performance. Further discussions of reinforcement learning based systems can be found in Neural Networks for Control, pp. 36-47 (W. T. Miller III, R. S. Sutton & P. J. Werbos, eds., MIT Press, 1990), and H. R. Berenji, Refinement of Approximate Reasoning-based Controllers by Reinforcement Learning, Machine Learning: Proceedings of the Eighth International Workshop, Evanston, Ill., Jun. 27-29, 1991.
The conventional reinforcement learning based system of FIG. 1 can be broadly classified into two groups. The first group uses a critic network which is capable of providing an immediate evaluation of the plant's performance which is appropriate to the actual control objective. The gradient of the reinforcement signal as a function of the plant control signal is determined. The intent is to learn a model of the process by which the plant control signals lead to reinforcement signals. The controllers which fall within this group use heuristic dynamic programming ("HDP"), a back-propagated adaptive critic ("BAC"), dual heuristic dynamic programming ("DHP"), or globalized DHP.
In the second group, the space of plant control signal outputs is explored by modifying the plant control signals and observing how the reinforcement signal changes as a result. This is basically a trial and error way of learning, such as that studied by a psychologist in which behavior is selected according to its consequences in producing reinforcement. The theory of learning automata also falls within this category. Other types of controllers which fall within this group include an associative search network, or associative reward-penalty network, and adaptive heuristic critic ("AHC"). (Yet a third group includes a controller using back propagation through time ["BTT"]).
Referring to FIG. 2, a controller using heuristic dynamic programming uses adaptation in the evaluation, or critic, network only. The output signal J(t) of the critic network is a function of the input signal X(t) and its internal weights W, wherein the input signal X(t) comprises the plant performance signal. Intuitively, it can be seen that it would be desireable to make the output signal J(t) of the critic network an accurate representation of "how good" the plant performance is, as indicated by the plant performance signal X(t). To do this, the critic network must be trained by either adjusting its weights W after each pattern, or after passing the whole pattern set (i.e. batch learning). Some sort of supervised learning, such as back propagation, is needed for such learning. However, a problem exists in that the target is not known. Thus, before we begin adaptation of the weights W during a particular pass, we must plug in the next value of the input signal X(t), i.e. X(t+1), into the critic network using the old weights in order to calculate the target for each time period t, or pattern.
Referring to FIG. 3, a back propagated adaptive critic adapts the weights of the action network. To do this, back propagation is used to compute the derivatives of the reinforcement signal J(t) with respect to the weights of the action network. The weights are then changed using standard back propagation, i.e. back propagating through the critic network to the plant model and to the action network, as indicated by the dashed lines. This type of controller is unsatisfactory because of the need for plant model. For a complex system, it is difficult if not impossible, to get a realistic model. If there is a sudden, unexpected event, the system must wait for a change in its model before responding.
To overcome the limitations of the back propagated adaptive critic network, a controller using dual heuristic programming was developed. It is very similar to a controller using heuristic dynamic programming, except for the targets. This type of critic network has multiple outputs and has a basic block diagram similar to that shown in FIG. 3. Further this type of critic network is a "derivative type" in that it calculates the values of the targets by using the derivatives of the reinforcement signal J(t) with respect to the plant performance signal R. Here, back propagation is used as a way to get the targets, not as a way to adapt the network to match the target. However, this type of controller works only with a linear plant model, and learning takes place only within the critic network. Moreover, this type of controller uses only the value of a single output signal J(t) to evaluate all aspects of performance, thereby causing learning to be slow and not robust. Further discussion of the foregoing controllers can be found in Neural Networks for Control, pp. 67-87 (W. T. Miller III, R. S. Sutton & P. J. Werbos, eds., MIT Press, 1990).
A controller which operates in accordance with the theory of learning automata probabilistically selects end outputs control signals from among a finite set of possible control signals, and updates its probabilities on the basis of evaluative feedback. This approach is effective when the evaluation process is still stochastic, e.g. for nonassociative learning, and the task is to maximize the expected value of the reinforcement signal. This approach can be used for associative learning by using a lookup table stored with data representing the requisite mappings. However, limitations of these types of controllers include the fact that such learning is only an approximation and may not be stable.
Referring to FIG. 4, a controller using an associative reward-penalty network using learning which is based upon the expected value of the output. However, a controller of this type requires much learning time, and such learning is not robust.
Referring to FIG. 5, a controller using an adaptive heuristic critic network relies on both its critic network and action network. This type of controller develops an evaluation function whose value V for a given state is a prediction of future discounted failure signals. Changes in V due to problem state transitions are combined with a failure signal R to form R'. For all states except those corresponding to failure, R=0 and R' is just the difference between the successive values of V, i.e. a prediction of failure. This scheme is used to teach the critic network. This is basically supervised learning.
Such supervised learning cannot be used for the action net, however, because correct actions are not known. Hence, the output layer is modified in a way that increases the probability of an action that is followed by a positive value of R, and decreases the probability of an action followed by a negative R. The changing probability is proportional to the magnitude of R' and to the difference between the action and expected value of the action. Thus, the result of the unusual actions have more impact on the weight adjustments than to other actions. Back propagation is used to change the weights of the first layer. Further discussion of this technique can be found in C. W. Anderson, Strategy Learning With Multilayer Connectionist Representations, Proceedings of the Fourth International Workshop On Machine Learning (corrected version of report), Irvine, Calif., 1987, pp. 1-12. However, a problem with this type of controller lies in the use of supervised learning, since the actual target for the action network output units are not known. Further, the learning in the critic network takes place mainly during failure, making learning very slow.
Referring to FIG. 6, a controller using back propagation through time provides another approach to link between present action and future consequences, or to link between the present output and previous inputs. This type of controller has memory of previous time periods, and uses time derivatives of the weights, in addition to regular derivatives used in the back propagated term in a back propagated neural network. However, this type of controller does not have provisions for handling noise in the model of the plant, and is not suitable for real-time learning.
Accordingly, it would be desireable to have an intelligent controller with an improved reinforcement learning scheme, i.e. with critic and action networks having improved learning capabilities.