Mobile robots have been deployed in multiple types of environments, including environments having one or more people, such as offices and hospitals, to provide assistance in home, office and medical environments. However, to provide this assistance, mobile robots should plan efficient sequences of actions for performing tasks. Additionally, to most effectively provide assistance, sequence planning needs to be able to quickly respond to dynamic addition of new tasks or changes in the environment surrounding the mobile robot.
In particular, mobile robots often need to dynamically plan an action sequence for delivering an object to a person or location. Conventional methods for planning a delivery, or similar, task construct a policy for performing the task identifying which action to take at different states that arise, or may arise, during operation of the mobile robot using a Partially Observable Markov Decision Process (POMDP). Typically, the policy constructed by a POMDP maximizes a utility criterion, such as expected total reward, or minimizes an expected time for task completion. A policy may be manually encoded, but manual encoding limits the number of tasks described by the policy and makes modification of the policy to account for new tasks or environmental changes difficult.
Because of the limitations of manual encoding, policies for complex tasks involving decision making are usually generated by reinforcement learning, where actions taken at different states are optimized through trial and error. One method for reinforcement learning is State-Action-Reward-State-Action (“SARSA”), which learns a Q function approximating the value of an action taken at a state. The values of the Q function, or “Q values,” represent the sum of rewards for states in future time intervals, of “future states,” attenuated by the time.
However, because SARSA relies on the future reward from the next time interval when computing the Q value for the current time interval, uncertainty in determining the next state impairs the accuracy of a Q value. For example, when the relevant entities of the next state have not been observed, errors are introduced into the future reward from the next time interval, creating inaccuracies in the computed Q value. Thus, uncertainty in estimating the state during the next time interval causes inaccurate calculation of the Q value for the current state, reducing the accuracy of the generated policy which impairs task performance by the mobile robot.
Hence, there is a need for a system for reducing the effect of uncertainty in state observations in reinforcement learning.