Reinforcement learning (RL) is a class of methods used in machine learning to methodically modify the actions of an agent based on observed responses from its environment. RL can be applied where standard supervised learning is not applicable, and requires less a priori knowledge. In view of the advantages offered by RL methods, a recent objective of control system researchers is to introduce and develop RL techniques that result in optimal feedback controllers for dynamical systems that can be described in terms of ordinary differential equations. This includes most of the human-engineered systems, including aerospace systems, vehicles, robotic systems, electric motors, and many classes of industrial processes.
Optimal control is generally an offline design technique that requires full knowledge of the system dynamics, e.g., in the linear system case, one must solve the Riccati equation. On the other hand, adaptive control is a body of online methods that use measured data along system trajectories to learn to compensate for unknown system dynamics, disturbances, and modeling errors to provide guaranteed performance. Optimal adaptive controllers have been designed using indirect techniques, whereby the unknown machine is first identified and then a Riccati equation is solved. Inverse adaptive controllers have been provided that optimize a performance index, meaningful but not of the designer's choice.
Direct adaptive controllers that converge to optimal solutions for unknown systems are generally underdeveloped. However, various policy iteration (PI) and value iteration (VI) methods have been developed to solve online the HamiltonJacobiBellman (HJB) equation associated with the optimal control problem. Notably, such methods require measurement of the entire state vector of the dynamical system to be controlled.
For example, PI refers to a class of methods built as a two-step iteration: policy evaluation and policy improvement. Instead of trying a direct approach to solving the HJB equation, the PI starts by evaluating the cost/value of a given initial admissible (stabilizing) controller. The cost associated with this policy is then used to obtain a new improved control policy (i.e., a control policy that will have a lower associated cost than the previous one). This is often accomplished by minimizing a Hamiltonian function with respect to the new cost. The resulting policy is thus obtained based on a greedy policy update with respect to the new cost. These two steps of policy evaluation and policy improvement are repeated until the policy improvement step no longer changes the actual policy, and convergence to the optimal controller is achieved. One must note that the infinite horizon cost associated with a given policy can only be evaluated in the case of an admissible control policy, meaning that the control policy must be stabilizing.
Approximate dynamic programming (ADP is a class of reinforcement learning methods that have shown their importance in a variety of applications, including feedback control of dynamical systems. ADP generally requires full information about the system internal states, which is usually not available in practical situations. Indeed, although various control algorithms (e.g., state feedback) require full state knowledge, in practical implementations, taking measurements of the entire state vector is not feasible.
The state vector is generally estimated based on partial information about the system available by measuring the system's outputs. However, the state estimation techniques require a known model of the system dynamics. Unfortunately, in some situations, it is difficult to design and implement optimal state estimators because the system dynamics are not exactly known.
The lack of full state of the system makes ADP inapplicable to adaptive control application, which is undesirable. Accordingly, there is a need for a system and a method for data-driven output feedback control of a system with only partially observable state and underdetermined dynamic.