In general, the present invention relates to techniques for training neural networks employed in control systems for improved controller performance. More-particularly, the invention relates to a new feedback control system and associated method employing reinforcement learning with robust constraints for on-line training of at least one feedback controller connected in parallel with a novel reinforcement learning agent (sometimes referred to, herein, as xe2x80x9cRL agentxe2x80x9d). Unlike any prior attempt to apply reinforcement learning techniques to on-line control problems, the invention utilizes robust constraints along with reinforcement learning components, allowing for on-line training thereof, to augment the output of a feedback controller in operation-allowing for continual improved operation-moving toward optimal performance while effectively avoiding system instability. The system of the invention carries out at least one sequence of a stability phase followed by a learning phase. The stability phase includes the determination of a multi-dimensional boundary of values, or stability range, for which learning can take place while maintaining system stability. The learning phase comprises the generating a plurality of updated weight values in connection with the on-line training; if and until one of the updated weight values reaches the boundary, a next sequence is carried out comprising determining a next multi-dimensional boundary of values followed by a next learning phase. A multitude of sequences may take place during on-line training, each sequence marked by the calculation of a new boundary of values within which RL agent training, by way of an updating of neural network parameter values, is permitted to take place.
Use of conventional reinforcement learning alone (whether comprising a neural network), to optimize performance of a controller nearly guarantees system instability at some point, dictating that off-line training of sufficient duration must be done, initially, with either simulated or real data sets. Furthermore, while the use of robust control theory, without more, provides a very high level of confidence in system stability, this level of stability is gained at a cost: System control is much less aggressive. Such conservative operation of a feedback control system will rarely reach optimal performance.
Two key research trends led to the early development of reinforcement learning (RL): trial and error learning from psychology disciplines and traditional xe2x80x9cdynamic programmingxe2x80x9d methods from mathematics. RL began as a means for approximating the latter. Conventional RL networks interact with an environment by observing states, s, and selecting actions, a. After each moment of interaction (observing s and choosing an a), the network receives a feedback signal, or reinforcement signal, R, from the environment. This is much like the trial-and-error approach from animal learning and psychology. The goal of reinforcement learning is to devise a control algorithm, often referred to as a policy, that selects optimal actions for each observed state. Here according to the instant invention, optimal actions includes those which produce the highest reinforcements not only for the immediate action, but also for future states and actions not yet selected: the goal being improved overall performance. It is important to note that reinforcement learning is not limited to neural networks; the function and goal(s) of RL can be carried out by any function approximator, such as a polynomial, or a table may be used rather than a neural network, and so on.
In earlier work of the applicants, Anderson, C. W., et al, xe2x80x9cSynthesis of Reinforcement Learning, Neural Networks, and PI Control Applied to a Simulated Heating Coil.xe2x80x9d Journal of Artificial Intelligence in Engineering, Vol. 11, #4 pp. 423-431 (1997) and Anderson, C. W., et al, xe2x80x9cReinforcement Learning, Neural Networks and PI Control Applied to a Heating Coil.xe2x80x9d Solving Engineering Problems with Neural Networks: proceedings of the International Conference on Engineering Applicatoins of Neural Networks (EANN-96), ed. By Bulsari, A. B. et al. Systems Engineering Association, Turku, Finland, pp. 135-142 (1996), experimentation was performed on the system as configured in FIG. 8 of the latter (1997) of the above two references. In this prior work, applicants trained the reinforcement learning agent off-line for many repetitions, called trials, of a selected number of time-step interactions between a simulated heating coil and the combination of a reinforcement learning tool and the PI controller, to gather data set(s) for augmenting (by direct addition, at C) the output of the PI Controller during periods of actual use to control the heating coil. In this 1997 prior work, applicants define and applied a simple Q-learning type algorithm to implement the reinforcement learning.
In their pursuit to continue to analyze and characterize on-line training of a neural network connected to a feedback controller, it was not until later that the applicants identified and applied the unique technique of the instant invention employing a two phase technique, thus allowing for successful on-the-fly, real-time, training of a reinforcement learning agent in connection with a feedback controller, while ensuring stability of the system during the period of training. Conventionally, reinforcement learning had been applied to find solutions to control problems by learning good approximations to the optimal value function, J*, given by the solution to the Bellman optimality equation which can take the form identified as Eqn. (1) in Singh, S., et al, xe2x80x9cReinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems.xe2x80x9d (undated). And as mentioned earlier, when conventional RL is placed within a feedback control framework, it must be trained off-line in a manner that exposes the system to a wide variety of commands and disturbance signals, in order to become xe2x80x98experiencedxe2x80x99. This takes a great deal of time and extra expense.
The conventional techniques used to train neural networks off-line can become quite costly: Not only are resources spent in connection with off-line training time, but additional resources are spent when employing feedback controllers operating under conservative, less-aggressive control parameters. For instance, US Pat. No. 5,448,681 issued Sep. 5, 1995 to E. E. R. Khan, refers to what it identifies as a conventional reinforcement learning based system shown in Khan""s FIG. 1. A closer look at Khan ""681 reveals that no suggestion of stability is made. Khan does not attempt to control an interconnected controller on-line with its reinforcement learning subsystem (FIG. 1). Further, Khan simply doesn""t recognize or suggest any need for a stability analysis. Here, the conventional Khan system has to learn everything from scratch, off-line.
While there have been other earlier attempts at applying conventional notions of reinforcement learning to particular control problems, until applicants devised the instant invention, the stability of a feedback control system into which conventional reinforcement learning was incorporated for on-line learning, simply could not be guaranteed. But rather, one could expect that this type of conventional feedback control system, training itself on-the-fly, will pass through a state of instability in moving toward optimal system performance (see FIG. 4 hereof, particularly the path of weight trajectory 44 without application of constraints according to the invention). While academic study of conventional systems is interesting to note, in practice, these systems are not so interesting to an operators: It will crash before reaching an optimal state. Whereas, a control system employing the robust constraints of the two phased technique of the instant invention, will notxe2x80x94as one will better appreciate by tracing the lower weight trajectory 46 plotted in FIG. 4, representing that of a system operating according to the instant invention.
It is a primary object of the invention to provide a feedback control system for automatic on-line training of a controller for a plant to reach a generally optimal performance while maintaining stability of the control system. The system has a reinforcement learning agent connected in parallel with the controller. As can be appreciated, the innovative system and method employ a learning agent comprising an actor network and a critic network operatively arranged to carry out at least one sequence of a stability phase followed by a learning phase, as contemplated and described herein. The system and method can accommodate a wide variety of feedback controllers controlling a wide variety of plant features, structures and architecturesxe2x80x94all within the spirit and scope of design goals contemplated hereby. Advantages of providing the new system and associated method, include without limitation:
(a) System versatility;
(b) Simplicity of operation-automatic, unmanned long term operation;
(c) Speed with which an optimal state of system control may be reached; and
(d) System design flexibility.
Briefly described, once again, the invention includes a feedback control system for automatic on-line training of a controller for a plant. The system has a reinforcement learning agent connected in parallel with the controller. The learning agent comprises an actor network and a critic network operatively arranged to carry out at least one sequence of a stability phase followed by a learning phase. During the stability phase, a multi-dimensional boundary of values is determined. During the learning phase, a plurality of updated weight values is generated in connection with the on-line training, if and until one of the updated weight values reaches the boundary, at which time a next sequence is carried out to determine a next multi-dimensional boundary of values followed by a next learning phase.
In a second characterization, the invention includes a method for automatic on-line training of a feedback controller within a system comprising the controller and a plant by employing a reinforcement learning agent comprising a neural network to carry out at least one sequence comprising a stability phase followed by a learning phase. The stability phase comprises the step of determining a multi-dimensional boundary of neural network weight values for which the system""s stability can be maintained. The learning phase comprises the step of generating a plurality of updated weight values in connection with the on-line training; and if, during the learning phase, one of the updated weight values reaches the boundary, carrying out a next sequence comprising the step of determining a next multi-dimensional boundary of weight values followed by a next learning phase.
In a third characterization, the invention includes a computer executable program code on a computer readable storage medium, for on-line training of a feedback controller within a system comprising the controller and a plant. The program code comprises: a first program sub-code for initializing input and output weight values, respectively, Wt and Vt, of a neural network; a second program sub-code for instructing a reinforcement agent, comprising the neural network and a critic network, operatively arranged in parallel with the controller, to carry out a stability phase comprising determining a multi-dimensional boundary of neural network weight values for which the system""s stability can be maintained; and a third program sub-code for instructing the reinforcement agent to carry out a learning phase comprising generating a plurality of updated weight values in connection with the on-line training if and until any one of the updated weight values reaches the boundary, then instructing the reinforcement agent to carry out a next sequence comprising determining a next multi-dimensional boundary of weight values followed by a next learning phase. The first program sub-code can further comprise instructions for setting a plurality of table look-up entries of the critic network, to zero; and the third program sub-code can further comprise instructions for reading into a memory associated with the neural network, a state variable, s, to produce a control signal output, a, and reading into a memory associated with the critic network, a state and action pair to produce a value function, Q(s, a). The program code can further comprise instructions for exiting any of the learning phases for which a total number of the updated weight values generated, reaches a preselected value.
There are many further distinguishing features of the control system and method of the invention. The actor network preferably includes a neural network such as a feed-forward, two-layer network parameterized by input and output weight values, respectively, W and V. Input into the neural network is at least one state variable, s, such as a tracking error, e, along with one or more other state variables of the controller. The critic network can include a table look-up mechanism, or other suitable function approximator, into which a state and action pair/vector are input to produce a value function therefor. The critic network is preferably not interconnected as a direct part of the control system feedback loop. The state and action pair can include any such state, s, and a control signal output from the actor network, a, to produce, accordingly, the value function, Q(s, a). The multi-dimensional boundary of values is preferably a stability range which can be defined by perturbation weight matrices, dW and dV, in the two-dimensional case, and up to any number of perturbation matrices, thus creating a higher-dimensional stability space, depending on neural network parameterization characteristics.
Input and output weight values, respectively, W and V, of the neural network can be initialized by randomly selecting small numbers such as, for example, selecting numbers from a Gaussian distribution having a mean equal to zero and some small variance such as 0.1. Input and output output weight values for any current step t, can be designated respectively, Wt and Vt. The control signal output from the actor network preferably contributes, along with an output from the controller, to an input of the plant. In order to determine the next multi-dimensional boundary of values, an initial guess, P, of said stability range can be made; this initial guess, P, being proportional to a vector N, according to the expressions below:
N=(Wt,Vt)=(n1,n2, . . . )   P  =      N                  ∑        i            ⁢              n        i            
In the event one of the updated weight values reaches the first boundary, a next sequence is carried out to determine a next multi-dimensional boundary of values and to generate a plurality of next updated weight values. In the event one of the next updated weight values reaches this next boundary, a third sequence is carried out to determine a third multi-dimensional boundary of values and to generate a plurality of third updated weight values; and so on, targeting a generally optimal state of system control-until a system disturbance occurs-thus, launching another series of sequences, each including a stability phase and learning phase, allowing for on-line training of the RL agent in a manner that maintains system stability while targeting a state of optimal system control. For example, the method may be carried out such that one of the next updated weight values reaches the next boundary so that a third sequence is carried out to determine a third multi-dimensional boundary of values comprising a third stability range and to generate a plurality of third updated weight values; thereafter, one of these third updated weight values reaches its third boundary so that a fourth sequence is carried out to determine a fourth multi-dimensional boundary of values comprising a fourth stability range and to generate a plurality of fourth updated weight values.
It is possible that only a couple of sequences may need to be carried out, or a large number of sequences are needed to reach an acceptable optimal system control. During each learning phase, preferably to refrain from engaging the learning phase for an indefinite time with little or no improvement to control performance, on-line training is performed either until a current boundary is reached or until a total number of updated weight values reaches a preselected value, at which time the current learning phase is exited. And, if optimal performance has been reached during that current learning phase such that no further on-line training of the reinforcement learning agent is necessary, no new sequence need be carried out. If, on the other hand, the total number of updated weight values generated equals the preselected value and optimal performance has not been reached, then a next boundary is determined providing a new stability range within which a subsequent learning phase can be carried out.