The decision processes of acting under uncertainty and reasoning about the possibilities of future states is a widely cited challenge that has been researched for many years. When applied to autonomous systems, a prominent class of problems that can be addressed with this decision process can be summed up as whether to act now based on current evidence or to wait for more evidence that may potentially improve the action selection, at the cost of delay.
By way of a practical example, physically situated systems such as robots or embodied conversational agents typically rely on continual sensing to make inferences about the state of their sensed world and to guide their decisions. To identify ideal actions over time, these systems need to evaluate whether to act immediately using current sensory data or wait for more data that may possibly improve state estimates before acting. Consider a conversational agent embodied as a program that operates a display monitor, speakers, microphone and camera mounted outside a person's office. The agent may use a combination of face detection and tracking components to track the trajectory of people in its vicinity based on an analysis of pixels in the video stream. In addition, a face recognition component may be used to identify actors in the scene. At a higher level, the spatial trajectory and identity percepts can be fused to make inferences about the person's goals, and ultimately drive interaction decisions, such as when to initiate or break conversational engagement with people nearby.
The traditional approach to deliberating about the value of collecting additional information in advance of action is to compute the expected value of information (VOI), which is a measure of the difference of the expected value of the best decision before and after information is collected, considering the cost of acquiring the information. This includes the loss in value associated with the delay of action to await for the new information. However, with an autonomous system such as a conversational agent, the nature of the sensory evidence is streaming and high-dimensional (e.g., thousands of pixels regularly received in captured frames). There are challenges with computing VOI in settings with streaming, high-dimensional sensory evidence that make the traditional approaches unsuitable.