Field of the Invention
Embodiments of the present invention generally relate to using audio cues to improve retrieval of objects from visual data, e.g., video frames from one or more video cameras.
Description of the Related Art
The ubiquitous presence of cameras in our personal environments such as wearable cameras, e.g., augmented reality (AR) glasses, surveillance devices in smart homes, or video sensors in smart appliances, allows for the deployment of object retrieval applications for real world use. An example of an object retrieval application is a user asking her AR glasses or her smart home console “Where did I put my car keys?” Object retrieval in real world applications is a difficult problem because the desired object may be captured by a camera in a variety of poses, under varying illumination, and with arbitrary amounts of occlusion. Further, large amounts of video data may need to be searched in the attempt to locate the desired object.