1. Field of the Invention
The present invention generally relates to intelligent control systems and more particularly in one exemplary aspect to computer apparatus and methods for implementing an adaptive critic within e.g., an adaptive critic framework.
2. Description of Related Art
So-called “intelligent control” is a class of control techniques that utilize various computing approaches from artificial intelligence, including neural networks, Bayesian probability, fuzzy logic, machine learning, evolutionary computation and genetic algorithms (see White, D. and Sofge, D. (Eds.) (1992) Handbook of Intelligent Control: Neural, Fuzzy and Adaptive Approaches. Van Nostrand Reinhold, N.Y., incorporated herein by reference in its entirety). Intelligent controllers are finding increasing use today in complex systems control applications, such as for example autonomous robotic apparatus for navigation, perception, reaching, grasping, object manipulation, etc. (see Samad T., (Ed.) (2001) Perspectives in Control: New Concepts and Applications,” IEEE Press, N.J., incorporated herein by reference in its entirety).
Typically, intelligent controllers need to infer a relationship between the control signals (generated by the controller) and operational consequences upon a controlled apparatus (also referred to as “the plant”), that are described by changes of the plant state. Various learning methods are often used by intelligent controllers in order to approximate such relationships (see White and Sofge discussed supra; and Samad, T. (Ed.) (2001) Perspectives in Control Engineering. New York, each incorporated herein by reference in its entirety). By way of example, controllers that are used in tracking applications (such as, for example, robotic arms welding or painting car pieces along a predefined trajectory, mobile robots following predefined paths, etc.), aim to ensure that the plant state follows a desired trajectory (the target state trajectory) as closely as possible. In order to achieve trajectory tracking, the controller modifies control parameters (such as e.g. control gains) aiming to minimize an error between the target plant state (such as, for example, a desired robot configuration) and the actual (observed) plant state (such as, for example, an actual robot configuration) at every time instance. Performance of such controllers is typically quantified either by the magnitude of the tracking error, or by certain monotonic functions of the tracking error that are minimized, corresponding to minimized error between the target and actual state. Such functions are commonly referred to as the performance measures (see Goodwin G. (2001). Control System Design. Prentice Hall, incorporated herein by reference in its entirety).
Adaptive critic design (ACD) is a class of adaptive algorithms for intelligent control applications that is suitable for learning in noisy, nonlinear, and non-stationary dynamic systems. A family of ACD algorithms was proposed by Werbos P. J. (1992) in “Approximate dynamic programming for real-time control and neural modeling”. Handbook of Intelligent Control: Neural, Fuzzy and Adaptive Approaches, White D. A. and Sofge D. A. discussed supra, incorporated herein by reference in its entirety, as an optimization technique.
The ACD methods utilize two parametric blocks, commonly referred to as the controller (or the “actor”) and the “critic”, as described below with respect to FIG. 1. The actor implements various parameterized control rules and generates control signal. In a typical setup, the critic estimates a function that captures the effect that the control law will have on the control performance over multiple time periods into the future. Depending on this estimate, the critic sends a reinforcement signal to the actor. The actor uses the reinforcement signal to modify its control parameters in order to improve the control rules so as to minimize the performance measure, which is equivalent to minimizing the control errors. More details on various adaptive critic designs can be found in: Werbos P. J. (1992), or Prokhorov D. V and Wunsch D. C. (1997) Adaptive Critic Designs, IEEE Trans Neural Networks, vol. 8, No. 5, pp. 997-1007, each incorporated herein by reference in its entirety.
Referring now to FIG. 1, one typical implementation of the intelligent control apparatus according to prior art is described. The control apparatus 100 comprises a control block comprised of a critic module 108 and an actor block 104, configured to control operation of a plant 106, such as for example a robotic apparatus, a heating ventilation and cooling (HVAC) plant, autonomous vehicle, etc.
The control apparatus 100 is configured to perform control tasks in order to achieve the desired (target) plant state. The control apparatus 100 receives the input desired state signal xd(t) (such as for example, a reference position of a robot, or a desired temperature for the heating ventilation and air conditioning (HVAC) system), and it produces a current plant state signal x(t). The control apparatus may further comprise a sensing or state estimation apparatus (not shown) that is used to provide a real-time estimate of the actual plant state x(t). The state signal x(t) typically represents full or partial dynamical state of the plant (e.g. current draw, motor position, speed, acceleration, temperature, etc.). In one variant, such as applicable to control applications where the full plant state x(t) is not available directly, various estimation methods may be used for state estimation, such as those described by Lendek Z. S., Babuska R., and De Schutter B. (2006) State Estimation under Uncertainty: A Survey. Technical report 06-004, Delft Center for Systems and Control Delft University of Technology, incorporated herein by reference in its entirety.
The control blocks 104, 108 of the control apparatus 100 receive the target state input xd(t) via the pathways 102 and implement various parameterized control rules in order to generate the control signal u(t) (comprising for example, vehicle speed/direction in a position tracking application; or heater power/fan speed in an HVAC application). The control signal is provided to the plant 106 via the pathway 110, and is configured to move the current plant state x(t) towards the desired state (the target state) xd(t). The control system 100 implements feedback control (closed-loop control), where the feedback signal x(t) is provided via the signaling lines 112 to the actor block. Alternatively, the control system 100 implements an open-loop control (feed-forward control) and in this case no feedback signal from the plant state to the actor is present. Other implementations exist, such as a combination of open-loop and close-loop control schemes.
The critic block 108 receives the target state input xd(t) via the pathway 102, the control signal u(t) via the pathway 110, the current state signal x(t) via the pathway 112_1. The critic block 108 is configured to estimate the control performance function V(t), also referred to as the “cost-to-go”, that is typically defined for discrete systems as follows:V(t)=Σk=0NγkJ(t+k),  Eqn. 1where:
γ is a discount factor for finite horizon problems (0<γ<1);
k is the time step index; and
J(t) is the performance measure (also known as a utility function or a local cost) at time t.
For continuous systems, the summation operation in Eqn. 1 is replaced by an integral, and the term γk is replaced with an exponential function.
The ‘cost-to-go’ function V(t) captures the effect that the control rules (implemented by the actor block 104) have on the control performance of the control apparatus 100 over a predetermined period of time into the future.
At every time step k, the critic block 108 provides ‘guidance’ to the actor block 104 via the reinforcement signal R(t) via the pathway 114. For discrete time systems, the reinforcement signal R(t) is typically defined based on the current estimate of the cost function V(t) and a prior estimate of the cost function V(t−1) as follows:R(t)=J(t)+γV(t)−V(t−1)  (Eqn. 2)
where γ is the same constant parameter as in Eqn. 1. More details on this methodology are be found e.g. in R. S. Sutton and A. G. Barto (1998), Reinforcement Learning—An Introduction. MIT Press, incorporated herein by reference in its entirety.
For continuous time systems the reinforcement signal is calculated as:R(t)=J(t)−βV(t)+{dot over (V)}(t)  (Eqn. 3)where □ is a constant parameter and {dot over (V)}(t) is a time derivative of V(t). More details on the continuous time version are provided in; Kenji Doya (2000), Reinforcement Learning in Continuous Time and Space, Neural Computation, 12:1, 219-245, incorporated herein by reference in its entirety.
Typically, the actor block 104 has no a priori knowledge of the plant 106 dynamic model. Based on the reinforcement signal R(t), the actor block 104 modifies its control parameters (such as for example, gain) in order to generate the control signal u(t) which minimizes the cost-to-go function. For example, in the case of trajectory tracking or set-point control tasks, the minimization of the cost-to-go function corresponds to minimizing the cumulative error between the target and actual plant state, computed as the plant progresses along control trajectory towards the target state.
FIG. 2 illustrates a typical implementation of the critic module 108 for discrete systems in the considered class of control systems (i.e., for set-point control or trajectory tracking tasks) according to the prior art. As shown in FIG. 2, the critic apparatus 108 (denoted by a dashed rectangle) comprises the performance measure block 230, the value estimator block 220, the adjustment block 224, and the delay block 226. The performance measure block 230 calculates the performance measure J(t) given the desired plant state xd(t) provided through pathway 202_1 and the actual plant state x(t) provided through pathway 212_1.
The value estimator block 220 in FIG. 2 receives the desired plant state signal xd(t) provided through pathway 202, the actual plant state signal x(t) provided through pathway 212, the control signal u(t) provided through the pathway 210, and reinforcement signal R(t) provided through the pathway 214. At each discrete step of the critic block operation, the value estimator block 220 generates a cost-to-go value signal V(t), based on the received inputs. The reinforcement signal R(t) is used by the value estimator block 220 to modify internal value estimator parameters as described in, e.g., Werbos; White D. A. and Sofge D. A.; or Prokhorov D. V and Wunsch D. C., discussed supra.
The adjustment block 224 (denoted by the γ symbol in FIG. 2 receives the value signal V(t) through the pathway 232 and produces the discounted value signal γV(t). The delay block 226 (denoted by symbol z−1 in FIG. 2) receives the value signal V(t) through the pathway 232_1 and produces the value signal delayed by one simulation step, which is denoted as V(t−1) in FIG. 2. The reinforcement signal R(t) is then calculated by the computation block 240 as defined by Eqn. 2., given the performance signal J(t) provided through the pathway 218, the discounted value γV(t) produced by the adjustment block 224 and the delayed value signal V(t−1) provided by the delay block 226.
FIG. 2A illustrates a typical implementation of the critic block for continuous time control systems according to the prior art. As shown in FIG. 2A, the critic apparatus 208 (depicted by a dashed rectangle) comprises the value block 220, the performance measure block 230, the β-block 244 and the derivative block 248 (denoted in the FIG. 2A as d/dt). The value block 220 and the performance measure block 230 receive and produce the same signals and operate similarly to the discrete-time control apparatus described with respect to FIG. 2, supra. The β-block 244 of FIG. 2A receives the value signal V(t) through the pathway 232 and produces the discounted value signal βV(t). The derivative block 248 receives the value signal V(t) through the pathway 232_1, and produces the temporal derivative of the value signal {dot over (V)}(t). The reinforcement signal R(t) is calculated by the computation block 250 according to the relationship defined in Eqn. 3, given the performance signal J(t) provided through the pathway 218, the discounted value βV(t) produced by the β-block 244 and the temporal derivative {dot over (V)}(t) of the value signal provided by the derivative block 248.
Traditional ACD approaches, such as those described with respect to FIGS. 2 and 2A, suffer from several shortcomings, such as for example requiring estimates of future performance of the control system. Such predictions invariably are of limited accuracy, and often suffer from the problem of the “curse of dimensionality”; i.e., when system control solutions become unattainable as the number of the variables governing the dynamic system model increases.
Accordingly, there is a salient need for an adaptive critic design apparatus and associated methods that aim at optimizing control rules without the foregoing limitations; e.g., that are based on the observed present and past control system performance.