Object retrieval systems have been quite popular in today's commercial and entertainment businesses. For example, it is not unusual that a user is interested in finding the same or similar object that appears in the video he/she just watched. Traditional content-based image retrieval (CBIR) efforts focus on bridging the gap between low-level image features and high-level semantics by analyzing the whole content of static images without considering human interest. To put more emphasize on the potential object region, some methods have been made to approach human perception system by segmenting images into regions and model the image content via so-called region-based local features, but the performance of these methods is far from satisfactory due to the limitation of segmentation techniques and the obstacle of salient object identification especially when multiple objects are involved.
The difficulty of the retrieval task escalates into another level when dealing with frames from digital videos instead of static images because videos are usually filmed under various lighting conditions in an unconstrained manner. Specifically, there are three major difficulties for the task of video object retrieval. First, the potential objects of user interest in videos have an extremely noisy background with numerous variances such as deformation, occultation, rotation, scale, affine transform, and translation. Second, how to describe and represent the content in an image (video frame) to effective and efficiently is very critical for precisely retrieving the exact or similar object appeared in the video. Finally, the evaluation of an image retrieval system is relative subjective and lacks a widely acknowledged standard, which makes the improvement of object retrieval task even harder.
The disclosed methods and systems are directed to solve one or more problems set forth above and other problems.