Exemplary embodiments described herein are directed to the problem of finding an efficient online control policy, in which a learner agent interacts with an environment. The learner agent may take actions in the environment and accrue some amount of value as a consequence of the actions. For example, in some implementations, the learner agent may employ a robotic device that is used to lift and transport heavy steel beams in an outdoor building environment, or a robotic arm that is used to repair parts on various machines in an industrial factory environment. The robotic device or arm may require maintenance on a periodic schedule, but experiences downtime while undergoing maintenance. On the other hand, if the robotic device/arm breaks due to lack of maintenance, a much longer downtime may be experienced. The goal of the agent may be to maximize the average long-term value accrued. In this example, the actions taking may include taking the robotic device/arm offline for maintenance, and the value may be represented as the long-term uptime of the robotic device/arm.
Often, such agents are embodied as a machine learning system, and may be trained using, for example, reinforcement learning. In reinforcement learning, a model simulates the dynamics of the environment with which the agent interacts. Accordingly, the agent can experiment with many different actions and identify some of the consequences of the actions. However, it is typically not possible to propagate a single interaction trajectory through time in order to learn a policy. In most real-world situations the model of the environment that the agent interacts with is not fully known a priori, because future dynamics cannot be predicted with complete accuracy. Maximizing the accrued value over time can be difficult in these circumstances.
These problems tend to be compounded when the system must deal with a changing context, such as when the value accrued depends on the actions or preferences of human actors (which can change over time and may not be entirely predictable). Existing reinforcement learning techniques generally require a significant amount of time and computer processing resources to adjust the model in response to a change in the environment. This may be disastrous when the environmental change requires quick action; it would be desirable for the system to change its actions within a few interactions and without waiting for the end of a predetermined time horizon to update its decision making.
Still further, the user preferences may not be monolithic. Different users may prefer different actions to different degrees and at different times, and hence the system must be particularly careful at each possible interaction point.