As more and more digital images become electronically available via network storage and network communications (e.g., the Internet), recognizing objects in these digital images has also become more important. For example, object recognition in images may be relevant to image-based search queries and image retrieval, geo-localization applications, tourist guide applications and so forth.
At least some conventional approaches to object recognition use a process that seeks to match i) a query image with a known image of the object via scale-invariant feature transform (SIFT) techniques, or ii) the query image with a three-dimensional (3-D) image model developed from multi-view geometry techniques. In particular, these conventional approaches are based on matching individual points. However, image and/or object information conveyed by individual points is limited and lacks discriminating ability, which makes accurate object recognition difficult.
Therefore, the conventional approaches have developed to construct visual phrases (e.g., multiple points of an object) to preserve more discriminative information and improve object recognition. However, these visual phrases only consider co-occurrence statistics in a localized region of a two-dimensional (2-D) image plane of the query image, and therefore, the visual phrases are referred to as 2-D visual phrases. Consequently, the 2-D visual phrases are problematic for recognizing objects in images because the localized region associated with the 2-D visual phrases fails to consider projective transformations that occur due to viewpoint changes which result, for example, from a position from which a photo of the object was taken.