A convergence of maturing software technologies has enabled intelligent personal assistants (IPAs) to become practical for everyday use. Speech recognition accuracy, machine learning, and quick access to diverse data have been combined to make it possible for IPAs to understand and execute complex voice commands (as used herein, “command” refers to both directives and questions). Some well-known IPAs are Siri™ by Apple, Google Now (or Google Assistant)™, Amazon's Alexa™, Microsoft's Cortana™, Facebook's M™, Sirius (open source), among others.
While IPAs continue to improve in general capability, these agents have limited understanding of context as it applies to specific objects on a display of a device on which an IPA is executing (or at least some portion thereof, such as a front-end). Presently, to refer to specific on-screen objects, a user must describe properties of an object (e.g., a name) to specify a particular objector. Experimental IPAs have enabled verbose redundant descriptions of locations to specify objects. A user might speak a description such as “send the third object from the upper left corner”, “open the icon that is second from the bottom fourth from the right”, or “share the picture of my cat wearing a hat”. Such descriptive phrases can be tedious for a user to formulate and are often difficult for an IPA to interpret. Some IPAs are able to infer context for voice commands from information shown on the screen. However, this approach involves attempting to enumerate all objects of interest and is unable to specify context for particular objects. In addition, this approach is particularly limiting on larger devices or in multitasking scenarios where the object the user may be referring to (e.g. when speaking the command “share this”) is highly ambiguous. Some IPAs analyze whatever is on-screen and make inferences and assumptions about the objects based on properties of the objects and perhaps other factors such as recent use activities or targets thereof. This heuristic guesswork often fails to recognize the user's intended target. None of the prior approaches for determining which on-screen object a user is referring to have involved explicit manual (i.e., touch) designation. As a result, IPAs end up providing limited value to users as part of existing task flows.
It might appear convenient to use non-speech forms of user input to specify objects to an IPA, for instance touch inputs. However, most operating systems are already designed to handle touch inputs in pre-defined ways. A touch directed to an object is likely already reserved for triggering an expected response. Discussed below are techniques for enabling touch inputs to be used to specify context for IPAs without interfering with pre-existing touch functionality.