Human-computer interfaces that permit users to provide natural language or gestural inputs are becoming exceedingly pervasive. For example, a personal assistant application can receiving human speech and identify a command based on an analysis of that speech. The personal assistant application can perform or trigger operations in response to the identified command. Similarly, computer applications may receive images or video of a user and can detect human gestures from the images or video. The computer can interpret those gestures as commands and may perform or trigger operations responsive to the identified commands.
These techniques are also being applied in the field of robotics to enable human-robot interactions. For example, users may be able to provide gestures or speech inputs to a robot to command the robot to perform specific actions. In some examples, a user may refer to a particular object in a command, either by word or by gesture. In response to such a command, the robot may be required to identify a physical object in its environment that corresponds to the particular object referenced in the command. Challenges may arise where multiple objects in the robot's environment correspond to the particular object referenced in the command. In those instances, the robot may be required to disambiguate between the multiple objects, to identify a particular one of the multiple objects that the user likely intended to reference in their command.