The present invention, in some embodiments thereof, relates to control systems and, more specifically, but not exclusively, to using Markov decision processes (MDP) for determining optimal control actions of control systems.
Markov decision processes and their extensions, such as constrained Markov decision processes (CMDP), partially-observable MDP (PO-MDP), and the like, are decision optimization techniques in many practical system applications. The MDP are decision-making methods used when the decision outcome is determined in part randomly and in part controllably by a control system, such as a decision control system. In this scenario, the MDP method decides which control action at each time point is most likely to bring the system to optimal operational performance, such as when a cost function is minimized and/or a reward function is maximized. Each control action transitions the system from a beginning state to a new state, such as an ending state, where the new state may be determined, at least in part, by the control action chosen at that time point. Each control action is a combination of one or more control variables, such as a speed of pump, a temperature of a boiler, and an addition of a certain amount of a chemical, and the like. As used herein, the term control action means a specific action of changing a control variable to control a system at a specific discrete time point. Each system state is associated with one or more cost values and/or one or more reward values. The reward and cost values at each starting system state determine the optimum control action by computing a cost and/or reward function over the possible control actions for the ending system state. For example, when an optimal control action transitions the current system state to a new system state, the output value of the reward function may increase and/or the output value of the cost function may decrease.
The system state is a particular configuration of systems variables, such as a particular set of values for the system variables each acquired from a sensor. The values for each system variable are stratified into steps, such that each value step of a variable is a unique variable state. For example, each variable state is a value of system sensor. Each unique set of value steps for the system variables is a particular system state, such as the values of all system sensors attached to the application system. As used herein, the term system state means a specific set of values for all system variables of a system under control by a control system. System variables may be of two types: controllable and action-independent system variables. For example, action-independent system variables in an example application of a wastewater treatment plant are influent flow, influent chemical load, electricity cost time window, and the like. Actions and controllable system variables do not affect action-independent system variables. Controllable system variables describe internal and/or output characteristics of the system, and controllable system variables may be affected by past actions and system variables of any type.
For example, a sensor may be a hardware sensor of a physical measurement, such as a temperature sensor, and/or a software sensor, such as monitoring code for one or more data streams. In an industrial system application, a hardware sensor may be a temperature sensor, a position sensor, a pressure sensor, a flow sensor, a light sensor, a chemical species sensor, a pH sensor, a gas sensor, a fluid level sensor, a status sensor, a purity sensor, and the like. For example, a software sensor may be a monitoring program for an array of sensors and may sense the presence of sensor value patterns.
To select the optimal control action of a given system state, previously recorded system state transitions and associated control actions are analyzed to determine the transition probabilities when the system was in the same situation, or system state. For example, the control action that has the highest probability of bringing the system to a new state that has a higher reward value and/or lower cost value may be selected as the optimal control action. For example, a cost and/or reward function are computed for multiple states and control actions for a time range under consideration, such as over the coming week. For example, the time range is a long-term time range extending over months, years and the like. These calculations may use dynamic programming and/or linear programming techniques to find the optimal control actions based on the cost and/or reward functions.
Thus the transition probability values, such as probability values organized in a transition probability matrix, constitute a key component of determining control actions in the MDP method. Most implementations of MDP methods use known transition probabilities from previously acquired and/or measured system transition data and/or estimated transition probabilities from simulations of one or more system models. For example, measured transition data are sets of system sensor values acquired before and after a control action thereby recording the system state transition, where the system state transition occurred at a particular time. As used herein, the term transition data or transition datasets means the beginning system state, the executed control action, and the ending system state of a system state transition. For example, simulated transition data are sets of system sensor values simulated using one or more system models before and after a control action thereby recording the system state transition. The system models may be a series of equations that predict the changes to sensor values after execution of a control action. For example, the simulate changes to concentrations of chemical compounds after raising the temperature of a boiler according to a differential chemical equation.
MDP methods may be used to determine optimal operation, such as optimal decisions and/or control actions, in maintenance systems, health care systems, agriculture systems, management systems of water resources, wastewater treatment systems, and the like.