CPC G06N 7/01 (2023.01) [G05B 13/0265 (2013.01); G05B 15/02 (2013.01); G06N 20/00 (2019.01); Y02B 10/30 (2013.01)] | 20 Claims |
1. A method for controlling a system, comprising:
estimating an optimal control policy for the system;
receiving data representing sequential states and associated trajectories of the system, comprising off-policy states and associated off-policy trajectories;
improving the estimate of the optimal control policy by performing at least one approximate value iteration, each approximate value iteration comprising:
estimating an expected value of operation of the system dependent on the estimated optimal control policy;
using a complex return of the received data, biased by the off-policy states, to determine a bound dependent on at least the off-policy trajectories;
using the determined bound to improve the estimate of the expected value of operation of the system; and
updating the estimate of the optimal control policy, dependent on the improved estimate of the expected value of operation of the system using the determined bound; and
employing the updated estimate of the optimal control policy to control the system with an automated controller, wherein the automated controller is configured to automatically alter at least one of the system, and an environment in which the system operates.
|