1. Field of the Invention
The present invention is directed to a method of providing stable control of a system (e.g., an electrical power grid, a factory, and a financial prediction system) using a neural network-based design, and more particularly, to a neural network-based control system and method using critics.
2. Discussion of the Background
The field of intelligent control is an expansive area that has attempted to solve many complex problems. Significant research has previously been performed in the areas of classical control theory [1-6] and biologically-inspired intelligent control [7-15]. As an outgrowth of that and other research, researchers have attempted to solve the Generalized Moving Target (GMT) problem [16,17], which is defined as follows: for a function E(v, w), find two vectors v and w such that the following two conditions are met: (1) E is minimized with respect to v for w fixed; (2)w=v.
Other research includes two additional classes of control design. The first class is xe2x80x9cadaptive control,xe2x80x9d in the tradition of Narendra, which includes both linear (multiple-input multiple-output, MIMO) designs [1] and nonlinear or neural extensions [18,19]. The second class is learning-based Approximate Dynamic Programming (ADP) [7-15], which has sometimes been presented as a form of xe2x80x9creinforcement learningxe2x80x9d [20-22], sometimes called xe2x80x9cadaptive criticsxe2x80x9d [23] and sometimes called xe2x80x9cneuro-dynamic programmingxe2x80x9d [24].
Previous forms of adaptive control discussed by Narendra, had difficulty in ensuring stability even in the linear-quadratic case. Those designs appear very similar to a particular ADP design, HDP+BAC[8,22,25], that was developed back in 1971, and reported in Werbos"" Ph.D. proposal[26]. In fact, when HDP+BAC is applied to the linear-quadratic case, it reduces down to a standard indirect adaptive control (IAC) design, except that there is a provision for adapting one extra matrix which, in principle, should permit much stronger stability guarantees. Roughly speaking, ordinary LAC designs are based on the minimization of tracking error, which is sometimes just:                                                         ∑              i                        ⁢                                          (                                                                            x                      i                                              (                        ref                        )                                                              ⁡                                          (                                              t                        +                        1                                            )                                                        -                                                            x                      i                                        ⁡                                          (                                              t                        +                        1                                            )                                                                      )                            2                                =                                                                      e                  _                                ⁡                                  (                                      t                    +                    1                                    )                                            T                        ⁢                                          e                _                            ⁡                              (                                  t                  +                  1                                )                                                    ,                            (        1        )            
where the vector x(ref)(t) represents the desired state of the plant or external environment at time t, where x(t) represents the actual state, and e(t) represents the gap or error between the two. In the same situation, HDP+BAC reduces to the same overall design, except that:
Ĵ(t+1)=e(t+1)TCe(t+1),xe2x80x83xe2x80x83(2)
is minimized, where C is a matrix of xe2x80x9cCritic weightsxe2x80x9d to be adapted by HDP. In the nonlinear case, equation (2) is replaced by an artificial neural network (ANN), in order to allow Ĵ to approximate any nonlinear function. (In HDP and GDHP, it is possible to supplement the training of the Action network, to include some stochastic search (particularly in the offline mode) to help keep the system out of local minima [46]. Barto, Thrun and others have discussed other approaches to adding noise to the Action network or controller [8,21].)
In theory, if the critic weights C or the Critic network converged to the right values (the values which satisfy the Hamilton-Jacobi-Bellman equation [4,5,24,28]), then the function Ĵ would serve as a Liapunov function guaranteed to stabilize the overall system, if the system is controllable. However, none of the established methods for adapting a Critic possess quadratic unconditional stability, even in the linear deterministic case. The variations of these methods based on the Galerkin approach to solving differential equations do possess unconditional stability in that case, but they converge to the wrong weights almost always in the linear stochastic case.
The notation herein generally tracks the sources upon which it draws. However two biased selections have been made herein. Herein, xe2x80x9cCxe2x80x9d, xe2x80x9cbxe2x80x9d and W are used for the estimated values of certain sets of parameters or weights, and xe2x80x9cC*xe2x80x9d or xe2x80x9cb*xe2x80x9d are used for their true values. On the contrary, for the functions J and xcex, the usual three-fold convention of using the xe2x80x9cJxe2x80x9d and xe2x80x9cxcexxe2x80x9d for the true values, xe2x80x9cJ*xe2x80x9d and xe2x80x9cxcex*xe2x80x9d for target values used in adaptation, and xe2x80x9cĴxe2x80x9d and xe2x80x9c{umlaut over ({circumflex over (e)})}xe2x80x9d for the estimated values are used. The state vector is identified as xe2x80x9c∂txxe2x80x9d in linear adaptive control, rather than the usual x-dot. The notation xe2x80x9c∂txe2x80x9d has long been part of the standard notation in physics.
In principle, the ideal feedback control system should try to optimize some mix of stability and performance, in the face of three general types of uncertainty:
(1) High-bandwidth random disturbances, which are usually represented as a stochastic process, based on random noise [4,5,26,29], but are sometimes represented as bounded disturbances of unknown structure [1];
(2) Drifting values (and occasional abrupt changes) in familiar process parameters such as friction, viscosity and mass, often due to the aging of a plant or the changes in general environmental conditions:
(3) Uncertainty about the fundamental structure of the plant or environment (sometimes due to catastrophic events, like a wing being shot off of an airplane), and shifts of parameters in ways that could not be anticipated or simulated even as possibilities at the time when the controller is developed.
This three-fold distinction is difficult to formalize, in mathematical terms, but it has great practical importance. Roughly speaking, the ability to respond to the first type of disturbance or uncertainty may be called xe2x80x9cstochastic feedback control.xe2x80x9d The ability to respond to the second type may be called xe2x80x9cadaptation,xe2x80x9d and the ability to respond to the third type is called xe2x80x9clearning.xe2x80x9d The practical tradeoffs here are discussed at some length in the introductory review in [7], and in other papers cited in [7].
xe2x80x9cAdaptive controlxe2x80x9d [1-3] has often been viewed as a tool for addressing the second type of uncertaintyxe2x80x94uncertainty about drifting plant parameters. The most classical designs in adaptive control are intended to control plants x(t) governed by the equations:
∂tx=Ax+Bu,xe2x80x83xe2x80x83(3)
or:
x(t+1)=Ax(t)+Bu(t),xe2x80x83xe2x80x83(4)
where u is a vector of controls, where A and B are unknown matrices representing the parameters of the plant, and where ∂t represents differentiation with respect to time. In the
simplest case, the state vector x is directly observable. The key idea is to develop control designs which estimate the matrices A and B, explicitly or implicitly, as the process unrolls in real time, and converge xe2x80x9con the flyxe2x80x9d to a good control strategy:
u(t)=K(t)x(t),xe2x80x83xe2x80x83(5)
despite the ignorance about A and B. In the general case, it is assumed that x(t) is not directly observable. Instead, a vector v governed by:
v(t)=Hx(t)xe2x80x83xe2x80x83(6)
is observed. Roughly speaking, this requires the use of a more complex control strategy like:
u(t)=K1,1v(txe2x88x921)+K1,2v(txe2x88x922)+ . . . +K1,kv(txe2x88x92k)+K2,1u(txe2x88x921)+ . . . +K2,ku(txe2x88x92k),xe2x80x83xe2x80x83(7)
for some integer k. (See [1, p.411].) There exists a huge body of stability theorems for adaptive control, both in the original linear versions [1-3] and in a variety of nonlinear and neural extensions (e.g., [18,19]).
In practice, however, all of these stability theorems require very strong assumptions about the plant or environment to be controlled. Ordinary adaptive control and neural adaptive control have often exhibited problems with stability and with slow response to transient disturbances, particularly in real-world plants containing delays and deadzones and reversal phenomena, etc.
Because of these problems, the most powerful approaches available today to cope with type-two uncertainty in real engineering applications are:
(1) Adaptive-predictive control [2], in the form which explicitly minimizes tracking error over multiple time periods ahead into the future. This involves a substantial increase in computational complexity, for computations that must be performed in real time.
(2) A two step design process, performed offline before the controller is actually used on the real commercial plant. In the first step, one designs a controller containing very strong feedback loops and/or observers, able to estimate (explicitly or implicitly) the specific plant parameters that are expected to drift. These loops take over the role which the estimates of xe2x80x9cAxe2x80x9d and xe2x80x9cBxe2x80x9d would play in an adaptive control approach. In the second step, one exploits prior knowledge about the specific plant, including knowledge about its uncertainties, in order to tune or train this controller. The controller is tuned or trained for optimal performance over multiple time periods, in some kind of offline training process.
Optimization may be done by using (1) an n-period lookahead method, analogous to model-predictive control [2.8], or (2) dynamic programming. The second approach often has very good response to transient disturbances, because of how it exploits prior knowledge about the plant. The first approach, however, appears to have had very limited success. Astrom""s work in general [2] is well-respected for the many industrial applications it has led to.
Athans and Baras [6] have been very strong spokesmen for the second approach, within the nonlinear robust control community. Athans, in particular, has worked with many aerospace applications, where freezing the dynamics of a controller prior to actual flight can simplify the process of obtaining government approval. Robust control theorists have shown that type-one disturbances convert the problem of controller design into a problem of stochastic optimization. (This is true no matter how the disturbances are representedxe2x80x94either as stochastic noise or as xe2x80x9cworst case noisexe2x80x9d in the spirit of Tsypkin.) Nonlinear stochastic optimization problems require the use of dynamic programming, rather than a simple n-step lookahead. In order to find numerical solutions to such stochastic optimization problems, one can simply use learning-based ADP, applied in an offline learning mode.
In the neural network field, the second approach was first proposed in 1990 [30] and called xe2x80x9clearning offline to be adaptive online.xe2x80x9d In one embodiment, a Time-Lagged Recurrent Network (TLRN) is used as a controller [8,ch10; 26.ch.8] with feedback loops in the control network. Additional inputs can be fed to the controller, by including the intermediate outputs of TLRNs trained to perform system identification of the plant. Such intermediate outputs serve as neural observers. Learning offline to be adaptive online is the foundation of Ford""s xe2x80x9cmultistreaming approachxe2x80x9d [7,31], which underlies the most successful and impressive success of neural networks to date in real-world control applications. Most of the Ford work has used n-step lookahead optimization, using backpropagation through time (BTT) to minimize the computational costs. With BTT and special chips, Fold expects to be able to extend this work to include adaptation on the fly, so as to upgrade the initial controller developed offline. BTT [26] can be used to reduce the cost of computing exact derivatives through any feedforward differentiable system, not just neural networks. (For systems which are not feedforward, see the review in [32]. See www.nd.com for some tools to implement BTT and various special cases of TLRN.)
Many researchers have been overwhelmed by the sheer complexity of biological brains. Particularly in the cerebral cortex of mammals, there is a bewildering array of cells that perform a wide variety of tasks: Many people have despaired of finding any universal principles of information processing at work there, which could be understood in mathematical terms. However, studies of mass action in the cerebral cortex have demonstrated that cells in one part of the cortex can learn to take over the functions of other cells, when there is a need to do so (and when the required data inputs are not cut off). (See the work of Lashley, Pribram and Freeman and [9-11,26].) However, the problem lies in developing the right kind of mathematics. Since the overall function of a biological brain is to compute actions or decisions, learning-based intelligent control may someday provide the required mathematics. (See [9-11] for more specific links between ADP designs and the brain.)
Nevertheless, the role of real-time learning should not be overstated, even in biology. Since the time of Freud at least, it has been well known that organisms remember past experiences, and adapt their general models or expectations about their environment based on some combination of present experience and memories. Back in 1977 [25], a simple learning scheme was proposed based on the more general idea of xe2x80x9csyncretism,xe2x80x9d an interpretation of Freud""s vision of the interplay between memory and generalization. Syncretism promises substantial practical benefits for neural network adaptation [34: 8,ch.3]. Early successes of xe2x80x9cmemory-based learningxe2x80x9d [35]""support this general approach. More recently, McClelland and others have promulgated theories of the interplay between memory and generalization that also embody a form of this idea [36], although the inventor has previously predicted that this interplay mainly occurs within the cerebral cortex and the hippocampus, rather than between them, as suggested by McClelland. Recent arguments by Pribram,. Alkon and others tend to support the former view [36]. But in any case, the psychological evidence for the phenomenon as such appears very convincing. Within the world of artificial intelligence, researchers such as Rosalind Picard of MIT have developed learning methods that also embody this general approach.
In an ideal world, in the linear deterministic case, a universal adaptive controller would exist with the following properties. First, the design and the adaptation rule would produce the output the vector u(t) at each time t, based only on knowledge about u(xcfx84) and v(xcfx84) at previous times xcfx84 less than t. It would not require knowledge about A, B or H, except for the dimensionality of u and v (and perhaps x). It should be guaranteed to send tracking error to zero, asymptotically, at an exponential rate, for all combinations of A, B and H and reference model for which this would be possible with a fixed controller, (The xe2x80x9creference modelxe2x80x9d is the processxe2x80x94generally considered linear herexe2x80x94which outputs the desired observed state v*(t).) In other words, the ideal adaptive controller should possess the same kinds of strong stability guarantees that exist for more familiar kinds of control, such as fixed-stricture LQG optimal control [4,5].
Simplistic adaptive control designs available today fill far short of this ideal. (Certain hybrid designs of Astrom[2] are more complex and are discussed later.) The bulk of the existing theory for linear adaptive control focuses on the Single-Input Single-Output (SISO) special casexe2x80x94the case where y and u are scalars. Yet even in that case, stability is normally guaranteed only under three restrictions (in addition to restrictions on the reference model). Those restrictions are: (1) information must be available to choose the integer xe2x80x9ckxe2x80x9d in equation 7; (2) the plant must be xe2x80x9cminimum phasexe2x80x9d; and (3) the sign of the high-frequency gain must be known Narendra and Annaswamy go on to state [1,p.359]: xe2x80x9c . . . it was realized that these assumptions were too restrictive, since in practice they are violated by most plants, even under fairly benign conditions.xe2x80x9d
The problem with the sign of the gain may be illustrated by a simplified, commonsense example. A fisherman has complete control over the big lake in which he fishes. The xe2x80x9creference modelxe2x80x9d is simply a certain quota (in weight) of fish that must be caught every year. One month the quota is not met The lake is usually capable of producing enough fish to meet the quota, in the long term, but an unknown disturbance has caused the fish population to go through a minor dieback. Following the usual policy of minimizing tracking error at time t+1, the fisherman increases his catch to arrive back at the quota. However, by fishing at a higher level. the fish population is reduced still further. This dips into the smaller fish that could have provided more growth in the future. Thus to meet the quota in the following month, even more fish are caught. In the end, the entire fish population dies off. That is, things go nonlinear, resulting in a catastrophic instability. This example leads to a host of further implications. But the key point is that policies that reduce tracking error (or maximize utility) in the short term may have undesirable or even catastrophic effects in the more distant future. Knowing the signs of the gains is crucial to avoiding those kinds of problems in adaptive control designs that do not look beyond time t+1. (The xe2x80x9cminimum phasexe2x80x9d assumption has similar, related pervasive implications, discussed at length by many researchers, including Widrow [37].)
Narendra and Annaswamy discuss methods that avoid the need to know the sign of the high-frequency gain, for the SISO case. However, those methods use a very special-purpose trick, the Nussbaum gain, which does not appear to carry over to the general, MIMO case in a serious way. The requirements for prior knowledge in the MIMO case [1, ch. 10] are far more complex and demanding than in the SISO case. After all, in the SISO case, there are only two possibilities for the sign of the high-frequency gain, which is a scalarxe2x80x94plus or minus. In the MIMO case, there are an infinite number of possible directions or modes of instability, and the requirements are very complex. Narendra has recently found another way to avoid having to know the sign of the gain, by using multiple models [38,39] in the SISO case. However, the minimum phase assumption is still required, and a universal MIMO controller developed in this way would be extremely complicated, if possible.
Difficulties with these restrictions probably explain why Lyle Ungar, of the University of Pennsylvania, found unstable results when he tested the most advanced direct, indirect and hybrid direct-indirect adaptive control designs on the bioreactor benchmark test problem given in [22]. There may even be an analogy between the harvesting of cells from a bioreactor and the harvesting of fish from a lake. This same problem was later solved directly and efficiently, both by neural model-predictive-control (based on BTT) and by ADP methods, in papers from Texas Tech [14] and from Ford Motor Company [34].
Narendra and Annaswamy [1] discuss both direct and indirect adaptive control designs for the linear case. However, in, 1990, Narendra [22,p. 135] stated that until xe2x80x9cdirect control methods are developed, adaptive control of nonlinear dynamical systems has to be carried out using indirect control methods.xe2x80x9d In 1992, he stated [8,p.168] that xe2x80x9cunless further assumptions concerning the input-output characteristics of the plant are made, direct adaptive control is not possible.xe2x80x9d
FIG. 1 is essentially identical to the nonlinear adaptive control design discussed by Narendra in [22,p. 135], [8,p.168], [41,p. 166] and elsewhere. (In the neural network context, this application uses the capital letter, X, rather than v, to indicate the vector of observables. R is used to indicate the estimated state vector (or xe2x80x9crepresentation of reality,xe2x80x9d usually based on recurrent neurons). Moreover, the phrase xe2x80x9cModel networkxe2x80x9d is used to describe what Narendra calls the identification network, Ni. The phrase term xe2x80x9cAction networkxe2x80x9d is used to describe what he calls the controller network, Nc. Both Narendra and this application assume that the Model network may be adapted in real time, by an adaptation rule independent from what is used to adapt the Action network. However, in his flowcharts, Narendra adds a few arrows (labeled xe2x80x9ceixe2x80x9d) to give a hint of how the Model network might be adapted.
Numerous implementations of the design in FIG. 1 have been reported in journal articles all over the world. Two points must, however, be known:
(1) How to adapt the Model network. Numerous ways [8,32] of doing this have been suggested with varying degrees of robustness. For linear adaptive control, simple least mean squares (LMS) learning may be adequate. (For example, see [1,p.402], [42].)
(2) How to adapt the Action network. Narendra proposes the weights Wij in the Action network be adapted in proportion to the derivatives of tracking error with respect to the weights. This may be written:                                                         W              ij                        ⁡                          (                              t                +                1                            )                                =                                                    W                ij                            ⁡                              (                t                )                                      -                          α              ⁢                              ∂                                  ∂                                      W                    ij                                                              ⁢                              (                                  E                  ⁡                                      (                                          t                      +                      1                                        )                                                  )                                                    ,                            (        8        )            
where:
E(t+1)=(e(t+1))2=(X*(t+1)xe2x88x92X(t+1))2xe2x80x83xe2x80x83(9)
is the tracking error at time t+1 and xcex1 is some arbitrary (small, positive) learning rate. Narendra describes calculating the derivatives in equation 8 by xe2x80x9cbackpropagating through the Model network.xe2x80x9d How to do this is described at length both in Narendra""s work andxe2x80x94for a broader class of possible network structures (neural or nonneural)xe2x80x94in other references. [8.26]. The broken arrows in FIGS. 1 and 2 represent the backwards calculations used to obtain the required derivatives at minimum computational cost. Additional literature discusses (1) alternative ways to choose the learning rate and (2) alternative gradient-based learning rules.
FIG. 1 has many limitations, as discussed herein and in other references [41]. Applied to the case of a fully observable linear plant, equation 8 reduces to:
K(t+1)=K(t)xe2x88x92xcex1BTe(t+1)xT(t),xe2x80x83xe2x80x83(10)
where the weights in the Action network are now just the matrix K of equation 5. xe2x80x9cBackpropagating through a networkxe2x80x9d (like the Model network, which reduces to equation 4 in the linear deterministic case) is simply a low-cost way to multiply a vector by the transpose of the Jacobian of that network; it reduces costs by exploiting the internal structure of the network, and working backwards.
FIG. 2 is very similar to FIG. 1, in a mechanical sense, but it has vastly different properties. In FIG. 2, the terms xe2x80x9cHDP+BACxe2x80x9d refer to Heuristic Dynamic Programming (HDP) and the Backpropagated Adaptive Critic (BAC). HDP is a technique for adapting the Critic network, developed in 1968-1972. BAC is the method used to adapt the Action network. HDP+BAC is an architecture for optimal control, which is structurally almost identical to IAC. HDP+BAC can be used in a variety of configurations, involving a combination of real-time learning, offline learning, prior information, etc. In order to implement a full, real-time learning version of HDP+BAC, three networks must be adapted (i.e., the Critic, Model and Action networks must be adapted).
Note that the design here is identical to the IAC design in FIG. 1, except that tracking the error (E) has been replaced by the Critic network. The Model network can be adapted by using any of the various methods previously proposed and used for that task. The Action network can be adapted by replacing equation 8 by:                                                         W              ij                        ⁡                          (                              t                +                1                            )                                =                                                    W                ij                            ⁡                              (                t                )                                      -                          α              ⁢                              ∂                                  ∂                                      W                    ij                                                              ⁢                              (                                                      J                    ^                                    ⁡                                      (                                          t                      +                      1                                        )                                                  )                                                    ,                            (        11        )            
where the derivatives are again calculated by backpropagation, in exactly the same way. (To initialize the backpropagation, however, the derivatives of Ĵ are first calculated with respect to its inputs, R(t+1,). Those derivatives can be calculated quite easily by backpropagating through the Critic network. In the deterministic case, the distinction between R and {circumflex over (R)} is not so important as it is in the stochastic case [8].) Strictly speaking, FIG. 2 represents a special case of HDP+BAC applicable to the tracking problem. In other words, to implement FIG. 2, the same computer code can be used to implement FIG. 1, except that xe2x80x9cExe2x80x9d is replaced by xe2x80x9cĴ.xe2x80x9d
HDP attempts to adapt the Critic network in such a way that its output, Ĵ, converges to a good approximate solution to the Bellman equation of dynamic programming. This equation may be written fairly generally [8] as:                                           J            ⁡                          (                                                R                  _                                ⁡                                  (                  t                  )                                            )                                =                                                    Max                                                      u                    _                                    ⁡                                      (                    t                    )                                                              ⁢                              {                                                      U                    ⁡                                          (                                                                                                    R                            _                                                    ⁡                                                      (                            t                            )                                                                          ,                                                                              u                            _                                                    ⁡                                                      (                            t                            )                                                                                              )                                                        +                                                            (                                              1                        /                                                  (                                                      1                            +                            r                                                    )                                                                    )                                        ⁢                                          ⟨                                              J                        ⁡                                                  (                                                                                    R                              _                                                        ⁡                                                          (                                                              t                                +                                1                                                            )                                                                                )                                                                    ⟩                                                                      }                                      -                          U              0                                      ,                            (        12        )            
where U is the utility or cost function to be maximized or minimized in the long term, where r is an interest rate parameter used to discount the value of fixture utility, where the angle brackets  less than   greater than  indicate expectation value, and where U0 is a parameter introduced by Howard [28] to extend dynamic programming to the case of an infinite time horizon with r=0. (Of course, for a minimization task, xe2x80x9cMaxxe2x80x9d is replaced by xe2x80x9cMinxe2x80x9d in equation 12.) When this method is applied to pure tracking problems, as in classical adaptive control, U is chosen to be the tracking error E, and the reference model is treated as a fixed augmentation of the Model network.
Karl Astrom has perhaps been the world""s number one leader in stimulating real-world applications of adaptive control. He has done this in part by developing sophisticated hybrid designs, in order to meet the demanding requirements of these applications. For example, he played a pioneering role in building a link between linear/quadratic optimization and adaptive control. In [8,ch.2], he and McAvoy give a broad overview of efforts by themselves and others to address the general subject of intelligent control. The latest edition of his classic text on adaptive control [45] contains many references to new work in these directions.
The original, classic text by Astrom and Wittenmark [2] discusses three types of adaptive control which overcome many of the restrictive assumptions discussed by Narendra: (1) multiperiod adaptive-predictive control; (2) adaptive pole placement; and (3) linear-quadratic Self-Tuning Regulators (STRs). (Astrom also has previously provided insight to the minimum phase problem. He pointed towards the adaptive pole placement and LQG/STR approach as the most important ways to overcome this problem.)
The future plans by Ford Motor Company to consider using multiperiod optimization in real time, along with real-time system identification, could be viewed as a possible future application of multi-period adaptive-predictive control. There are many other links, to the most successful applications of artificial neural networks [7]. Methods of this family are of great practical utility, but they (and adaptive pole placement methods) are not xe2x80x9ctruexe2x80x9d adaptive control designs as discussed herein. They are also not plausible as a theory of how the brain actually implements intelligent control. Also, it is hard to imagine a nonlinear stochastic generalization of that design. The discussion in [2] describes the method as an SISO method.
On the other hand, the discussion of linear-quadratic STRs [2] has many close links to the present invention. One version, based on spectral factorization, does not appear relevant to the goals described herein. But the other versionxe2x80x94Indirect STR based on Riccati equationxe2x80x94is extremely close. This version is called Algorithm 5.7 in [2], and Algorithm 4.4 in [45]. The new edition of the text is almost identical to the old version in this section. The matrix S(t) in Astrom""s equation 5.47 resemble the matrix of Critic weights C discussed herein. The methods used by Astrom and others to update the estimate of S(t) could be viewed as specialized Critic adaptation methods, for use in the linear/quadratic case. Unfortunately, the most reliable update methodsxe2x80x94based on solving a Riccati equation or performing a spectral factorizationxe2x80x94are not true adaptive control methods as defined above. Astrom hints that there have been serious stability problems when other update methods have been attempted. However, he also hints that D. W. Clarke of Oxford and M. Karny of Prague have been more central to this work in recent years. Karny has certainly made significant contributions to the larger area of intelligent systems [46].
Landelius"" method of implementing HDP was motivated in part by the most recent work in this literature [47], which he also cites and discusses.
As an another approach, Liapunov stability theory in general has influenced huge sections of control theory, physics, and many other disciplines. More narrowly, within the disciplines of control theory and robotics, many researchers have tried to stabilize complex systems by first deriving Liapunov functions for those systems. In some cases, the Liapunov functions have been derived analytically by solving the multi-period optimization problem in an analytic fashion.
Having derived an application-specific Liapunov function, one can then use the design of FIGS. 1 and 2, in which the Liapunov function replaces the square tracking error or J. Theoretically, by replacing the square tracking error with some other pre-specified error measure, new stability properties and new restrictions on the plant are created.
The Liapunov function approach has been particularly useful in the field of robotics [48], where many robot arms obey nonlinear but rigid dynamics. Robert Sanner of the University of Maryland and John Doyle of CalTech are often mentioned in discussions about the successes of this approach. Nevertheless, none of these analytically derived fixed Liapunov functions can be regarded as a universal adaptive controller, as discussed above, any more than IAC itself can.
From a practical point of view, it becomes increasingly difficult to derive such Liapunov functions analytically, as the complexity of the nonlinear systems (e.g., elastic, light-weight and flexible robot arms) increases. (See [7] for an explanation of some of Hirzinger""s success in working with such robots. Some of the unpublished work of Fukuda has been even more useful.) The difficulties here are analogous to the difficulty of trying to solve simple algebraic equations analytically. As the order of equations increases, eventually a point is reached where closed-form analytic methods simply cannot provide the solution.
Hundreds of papers have been published by now on various forms of approximate dynamic programming (ADP), adaptive critics and/or reinforcement learning. However, only three of the established, standard methods for adapting Critic networks are directly relevant to the goals herein: (1) Heuristic Dynamic Programming (HDP); (2) Dual Heuristic Programming (DHP); and (3) Globalized DHP (GDHP). The classic xe2x80x9cTemporal Differencexe2x80x9d (TD) method is essentially a special case of HDP. The recent xe2x80x9ctwo-samplexe2x80x9d method [24,50,51] is also relevant, but is discussed later.
As discussed above, the present invention focuses on achieving stability for methods of adapting Critic networks, not for entire adaptive critic control systems. Accordingly, a complete review of the larger question of how to adapt a complete ADP control system is not provided herein. For the concurrent adaptation of Model networks, Action networks and Critic networks, and for practical experience, see [8,14,15,33,52-56].
The only existing, working control designs which meet criteria/definition for xe2x80x9cModel-Based Adaptive Criticsxe2x80x9d (MBAC) or xe2x80x9cbrain-like intelligent controlxe2x80x9d [33] described herein are those based on FIG. 2, or the equivalent of FIG. 2 for DHP, GDHP and their variations. MBAC has been successfully implemented by at least three companies (Ford Motor Co., Accurate Automation Corp., Scientific Cybernetics Inc.), four professors working with graduate students (Wunsch, S. Balakrishnan, Lendaris, W.Tang), and three other individuals (Jameson, Landelius and Otwell). Since the first early implementations in 1993, MBAC has outperformed other modern control and neurocontrol methods in a variety of difficult simulated problems, ranging from missile interception (Balakrishnan) to preventing cars from skidding when driving over unexpected patches of ice (Lendaris). The one physical implementation published to date was also highly successful (Wunsch""s solution of Zadeh""s xe2x80x9cfuzzy ball and beamxe2x80x9d challenge). Balakrishnan has performed work on a larger physical implementation (a cantilever plate). The one alternative method which still seems competitive with MBAC, in terms of performance, in these tests, is neural model-predictive control based on BTT.
Roughly speaking, the adaptive critic field of research is a single new research field (albeit still divided into factions) which emerged around 1988-1990 through the unification of several previously separate strands of research. The origins of the field up to that time are summarized in FIG. 3. Within the field itself, the terns xe2x80x9creinforcement learning,xe2x80x9d xe2x80x9cadaptive critics.xe2x80x9d xe2x80x9capproximate dynamic programmingxe2x80x9d and xe2x80x9cneurodynamic programmingxe2x80x9d are normally viewed as approximate synonyms. Nevertheless, the choice of terms also reflects different goals for research within the field and different patterns of interest in related research outside of the field. FIG. 3 only represents the flow of key ideas in this topic prior to 1988-1990: it does not represent patterns of personal association, ideas in other areas, or important recent ideas that will be mentioned later.
The psychologist B. F. Skinner is well-known for the idea that rewards and punishments (xe2x80x9cprimary reinforcement signalsxe2x80x9d) determine the behavior of animals including humans. Many of the ADP designs used today in engineering or computer science are also being used as empirical models of animal behavior. Harry Klopf was a major pioneer of that tradition. Klopf; in turn, was responsible for recruiting and funding Barto to explore this strand of research. Barto has continued this tradition through ongoing collaborations with psychologists and neuroscientists.
Widrow never pursued these kinds of connections, but his seminal 1973 paper on adaptive critics [23] was clearly influenced by Skinner""s notion of reward and punishment. Skinner""s notion of a xe2x80x9csecondary reinforcement systemxe2x80x9d can be viewed as one way of talking about a Critic network.
Von Neumann and Morgenstern [57] invented the concept of cardinal utility function, which underlies the ADP approach. This work also made possible many new directions in economic theory, ranging from game theory to decision analysis and Bayesian utilitarianism. Working with Richard Bellman, Von Neumann also helped to inspire the development of dynamic programming. In criticizing the neuron models of McCulloch and Pitts, Von Neumann [58,p.451] made two major points: (1) that neuronal signals may be more accurately modeled as continuous variables rather than binary signals in many cases; (2) that study of mechanisms like learning and memory are more important than efforts to understand how particular static, learned functions can be represented in fixed neural networks. The successful revival of the neural network field in the past decade was based, in large part, on researchers finally embracing these two crucial insights.
On a philosophical level, Skinner and Von Neumann were extremely far apart. Skinner""s words about molding and controlling human beings certainly gave encouragement (even if unintentionally) to a variety of ideologies in the spirit of fascism and Communism. The revival of the neural network field in the 1980""s was motivated in part by a rejection of Skinner""s approach and the resulting search for new paradigms in psychology [59]. Von Neumann, on the other hand, encouraged a higher level of respect for human intelligence, autonomous human decision-making and human potential, from the very beginning. Despite these differences, tie most basic views and insights of both men have been fully incorporated into the ongoing research in this field. Different researchers have different attitudes about the deeper philosophical implications [12], even as they use the same mathematics.
The concept of reinforcement learning as a pathway to the construction of intelligent systems has often been credited to the great pioneers of artificial intelligence (AI)xe2x80x94Newell, Shaw and Simonxe2x80x94and to Marvin Minsky [60]. They proposed the development of machines that learn over time to maximize some measure of reward or reinforcement. They proposed that such a machine, simply by learning through experience, could gradually develop higher-order intelligence as a kind of emergent phenomenon. The earliest attempts to implement this idea were based more on brute-force stochastic search rather than optimization theory, and the results were somewhat disappointing. Samuels"" classic checkers-playing program [60] has been interpreted, in retrospect, as a kind of adaptive critic system. His adaptive xe2x80x9cstatic position evaluatorxe2x80x9d served as a kind of Critic. More recently, Tesauro""s master-class backgammon program [24,55] has clearly demonstrated that reinforcement learning systems can generate intelligent behavior.
In 1968, it was argued [20] that reinforcement learning could be used as a foundation for understanding intelligence in the brain. The concept of reinforcement learning (illustrated in FIG. 4) in the modern way was described. In this concept, a system learns a strategy of action that tries to maximize the long-term future expected value of utility, following the concepts of Von Neumann. For the first time, it was pointed out that a machine could be built to perform reinforcement learning simply by trying to approximate dynamic programming as formulated by Howard [28].
In 1972, the Werbos Harvard Ph.D. thesis proposal included a flowchart virtually identical to FIG. 2, with a detailed discussion of how to adapt the components of the system. However, the actual thesis focuses on a rigorous, generalized formulation of backpropagation as such, along with a demonstration of its effectiveness in system identification and political forecasting and a minimal discussion of neural networks [26,61]. (The thesis is reprinted in its entirety in [26].) Originally there were concerns about the learning speed of HDP, as one scales up to larger problems [8,22]. From 1977 to 1981, the ideas of HDP and of two more sophisticated methodsxe2x80x94DHP and GDHPxe2x80x94were published addressing these scaling problems [25,62,63]. In particular. [63] discussed the idea of using a neural network trained by backpropagation in order to approximate the J function of dynamic programming.
In the meantime, on a totally independent basis, Bernie Widrow [23] published the first working, successful implementation of an adaptive critic, using a neural network as a Critic network. His 1973 paper is the original source of the term xe2x80x9cCritic.xe2x80x9d However, the Critic adaptation method that he used was not a true real-time learning method and has not been used elsewhere. In 1983, Barto, Sutton and Anderson [21] published a classic paper which became extremely famous in the 1990""s. They implemented an adaptive critic system consisting of two adaptive elements or neurons. One element was a Critic, trained by a method which Sutton calls a Temporal Difference (TD) method. This method was developed on the basis of intuitive arguments about how to improve upon Widrow""s algorithm. The method used in that paper was a very narrow special case of HDP. Later, in a famous 1988 paper [64], Sutton [64] expanded the concept of temporal difference methods to include xe2x80x9cTD(xcex)xe2x80x9d for xcex in the interval [0,1]. TD(1) essentially reproduces die old Widrow method, and TD(0) a generalization of the method in [21]. Very few papers actually use TD(xcex) for xcex other than zero (see 15,24,55). The other element of the system in [21] was a kind of Action network, trained by an associative reward/punishment algorithm called Arp, which is not a method for adapting Critics as such.
In January 1987 [65], a paper on ADP, GDHP, potential engineering applications and links to neuroscience was published, once again emphasizing the connection between reinforcement learning and dynamic programming. This paper led to major efforts by Barto and Sutton to follow up on this new connection to dynamic programming. Barto encouraged the early efforts in this area by Bertsekas [24], who, along with Tsitsiklis, has substantially enlarged the concept of TD. The theory behind TD his been expanded considerably by these and other efforts; however, the method itself remains a special case or subset (proper or otherwise) of HDP.
Offline simulation or xe2x80x9cdreamingxe2x80x9d has also been discussed as a way of improving the performance of ADP systems (as previously discussed in [65]). (Current evidence from psychology is consistent with this as a partial interpretation of dreams in human beings as well [66].) This was the basis of the xe2x80x9cDynaxe2x80x9d architecture first presented by Sutton in [22]. This kind of offline learning is extremely important; however, the present invention is more suited to real-time systems.
The discussions of 1987 also contributed to the arrangement of an NSF workshop on neurocontrol, held in New Hampshire in 1988, which resulted in [22]. That workshop and book helped to bring together neurocontrol and adaptive critics as organized fields of research. FIG. 3.1 of that book illustrated the modem concept of reinforcement learning in a more picturesque way, equivalent to FIG. 4 above. Earlier forms of xe2x80x9creinforcement learningxe2x80x9d that did not address the mathematical problem of maximizing utility (or minimizing cost) over time [67] did not help to carry through the control strategy given above with reference to equation (12).
More recently, xe2x80x9cAction-Dependent Adaptive Criticsxe2x80x9d (ADAC) have been discussed. This generally includes Q-learning, ADHDP, ADDHP and ADGDHP [8,14], all of which are closely related. In fact, many of the xe2x80x9cnewxe2x80x9d designs for xe2x80x9cextendedxe2x80x9d or xe2x80x9cmodifiedxe2x80x9d or xe2x80x9cpolicyxe2x80x9d Q-learning are actually implementations of ADHDP, which was reviewed at length in 1992 [14]. That book also reported a successful implementation at McDonnell-Douglas in the manufacturing of composite parts. It reported a successful simulated application to the control of damaged aircraft, which later led to a large related pro gram at NASA Ames. All of those methods use Critic networks which input not only R(t) but also u(t). Instead of approximating the J function of dynamic programming, they approximate a related function. When adapting a Critic where the controller or Action network is held fixed, the action-dependent methods for Critic adaptation reduce to their older, action-independent equivalents.
Another family of reinforcement learning methods called Alopex has also receievd attention. The most effective version of Alopex appeals to have been developed by Tzanakou of Rutgers. In that version, the Critic network is an action-dependent system, adapted by a procedure quite similar to ADHDP or Q-learning. Most of Klopf""s final designs, in collaboration with Leemon Baird, have the same property. Although Q-learning and ADHDP appear less relevant, the two-sample method [50] discussed later is a significant exception.
In certain special cases, the same mathematics that underlie FIG. 2 can be implemented without a Model network as such. For example, in a plant with a single state variable, it is enough to know the sin of (∂R(t+1)/∂u(t)) in order to compute the derivatives of Ĵ(t+1) with respect to the weights, to within a scalar factor which effectively just modifies the learning rate xcex1. This kind of effect has permitted the development of architectures similar to FIG. 2, but simpler, with comparable performance in certain special cases. (For example, see Lewis [68] and by Berenji [69].)
More recently, Error Critics and a multiple-model hierarchical decision system, both motivated by recent findings from neuroscience [8-10,13,32,46], have been discussed. The hierarchical decision system is a higher-level learning design, which essentially requires ordinary Critic networks as subsystems.
Issues of stability and convergence have been a major concern of the adaptive critic community for a long time. Nevertheless, formal results about MBAC systems have been relatively sparse. There have been the Ph.D. theses of Prokhorov [15] and of Landelius [43], and unpublished results from Richard Saeks of Accurate Automation. Sacks"" work focuses on whole-system stability issues, rather than Critic adaptation as such. Whole-systems stability is the major concern in [15] and [43], but they also consider Critic adaptation as such. Landelius mainly proves stability and convergence for his own Critic implementations. Prokhorov [15] proves a whole-systems stability result for an offline learning version of HDPG+BAC in the deterministic case. He also reports considerable empirical testing of many approaches, as will be discussed below.
Tsitsiklis and Van Roy [73] have reviewed a large number of papers from the literature on lookup-table Critics, showing many specific examples where TDxe2x80x94a special case of HDPxe2x80x94becomes unstable. However, they prove that TD with lookup tables is stable, under certain restrictive assumptions. The key assumption is that the training examples x(t) must be taken from the probability distribution implied by the Markhov chain being studied. Stability against arbitrary training sequences is not guaranteed, even in the case of lookup table Critics.
In addition to these important but sparse theoretical results, considerable practical energy has gone into the effort to make these methods converge, for a variety of different test problems. This has been a very important practical issue, even in the context of offline learning. Some of this empirical work focused on the interplay between Action networks, Model networks and Critics. Narendra in [41] addresses related issues. But this paper focuses on Critic networks as such.
Among the tricks that are relevant to Critics as such, which have proven to be very useful in obtaining convergence, are:
(1) xe2x80x9cShaping,xe2x80x9d [7,8,453]. In shaping, one first adapts a network to solve one class of control problems, and then uses the resulting weights as the initial values of the weights of a network trained to solve similar but more difficult problems. When this method is applied over and over again, it provides a way of doing xe2x80x9cstep-by-step learningxe2x80x9d (not the same as real-time or incremental learning!) analogous to the way that humans really learn to perform difficult tasks.
(2) Interest rate management. In this approach, one starts out with very large values of r, and then one gradually decreases r to zero. (Usually r=0 represents our true values, if the utility function is properly articulated [12,74].) Strictly speaking, this is just a special case of shaping, because different values of r represent different optimization problems. Large values of r represent shorter-term optimization problems, which are usually easier to solve than the long-term problems.
(3) Utility function management. This can also be viewed as a form of shaping, in a way. In some cases, the choice of utility functions has even been used as way of cheating, of injecting prior knowledge about the desired controller into the adaptive system. However, for complex reasons [10,11,674], it does make sense to think of a hybrid man-machine control system, in which the human (as the upper controller) passes on something like his learned J function to the lower controller (the machine), which treats those inputs as its fundamental values (U). As a practical examplexe2x80x94when using gradient-based MBAC methods to search for optimal energy-saving xe2x80x9cchaos controllers,xe2x80x9d it makes sense to use a relatively flat cost function U, representing energy consumption, in some large acceptable region of state space, and then to add a gradually rising penalty function (e.g., quadratic) for points in some xe2x80x9cbuffer zonexe2x80x9d between that region and regions which are truly dangerous.
These three tricks have been successful in a very wide range of studies. Also, many researchers in ADP for discrete state space reported similar results, in discussions organized by Sridhar Mahadevan of the University of South Florida, in connection with an NSF workshop in April 1996 on reinforcement learning oriented towards A1.
It is an object of the present invention to provide a more stable control system design than previously available in the prior art.
This and other objects of the present invention are achieved by providing a new adaptive control system, implemented in either hardware or software (or as a hybrid of both). The control system (and its corresponding method) is designed to enhance the stability of the response of a controlled application (e.g., a factory, an airplane, a missile interception system. or a financial system).