1. Technical Field
The present invention relates to the field of Content-based Multimedia Information Retrieval [LSDJ06] and Computer Vision. More specifically, the invention contributes to the areas of Content-based Multimedia Information Retrieval concerned with the problem of searching large collections of images based on their content, and also to the area of Object Recognition which in Computer Vision is the task of finding a given object in an image or a video sequence.
2. Description of Related Art
Identifying a particular (identical) object in a collection of images is now reaching some maturity [SZ03]. The problem still appears challenging because objects' visual appearance may be different due to changes in viewpoint, lighting conditions, or due to partial occlusion, but solutions performing relatively well with small collections already exist. Currently the biggest remaining difficulties appear to be partial matching, allowing recognition of small objects “buried” within complex backgrounds, and scalability of systems needed to cope with truly large collections.
Now, recent relevant advances in the field of recognition performance will be discussed, specifically in the context of rapid identification of multiple small objects in complex scenes based on large collection of high-quality reference images.
In the late nineties David Lowe pioneered a new approach to object recognition by proposing the Scale-Invariant Feature Transform (widely known as SIFT) [LOW99] (U.S. Pat. No. 6,711,293). The basic idea behind Lowe's approach is fairly simple. Objects from the scene are characterized by local descriptors representing appearance of these objects at some interest points (salient image patches). The interest points are extracted in a way that is invariant to scale and rotation of objects present in the scene. FIG. 1 shows examples of SIFT interest key-points [LOW99, LOW04] detected for two photos of the same scene taken from significantly different points of view. The interest points are represented by circles. Centers of circles represent locations of key-points and their radiuses represent their scales. Intuitive interpretation of SIFT interest points is that they correspond to blob-like or corner-like structures and theirs scales closely corresponds to the size of these structures. It should be noted that, irrespectively to the viewing angles, most of the key-points are detected at the same position in the scene. The original images belong to the dataset created by Mikolajczyk et al., [MS04].
Descriptors extracted from a single training image of a reference object can then be used to identify instances of the object in new images (queries). Systems relying on the SIFT points can robustly identify objects in cluttered scenes, irrespectively on their scale, orientation, noise, and also, to a certain extend, on changes in viewpoint and illumination. Lowe's method has found many applications, including image retrieval and classification, object recognition, robot localization, image stitching and many others.
Encouraged by the performance of the SIFT method many researchers focused their work on further extending the capabilities of the approach. For example, Mikolajczyk and Smith [MS04] proposed affine covariant detectors that enabled unprecedented robustness to changes in viewing angles. Matas et al. [MCUP02] proposed an alternative method for extracting feature points termed Maximally Stable Extremal Regions which extracts interest points different to the ones selected by the SIFT detector. Very recently, Bay et al. [BTG06] proposed computationally efficient version of the SIFT method termed Speeded Up Robust Features (SURF). Surprisingly, the SURF detector is not only three times faster than the SIFT detector, but also, in some applications, it is capable of providing superior recognition performance. One of the most interesting examples of application of SURF is recognition of objects of art in an indoor museum containing 200 artifacts, providing recognition rate of 85.7%.
In many application areas the success of the feature point approaches has been truly spectacular. However, until recently, it was still impossible to build systems able to efficiently recognize objects in large collections of images. This situation improved when Sivic and Zisserman proposed to use feature points in a way, which mimics text retrieval systems [SZ03, SIV06]. In their approach, which they termed “Video Google”, feature points from [MS04] and [MCUP02] are quantized by k-means clustering into a vocabulary of the so-called Visual Words. As a result, each salient region can be easily mapped to the closest Visual Word, i.e. key-points are represented by visual words. An image is then represented as a “Bag of Visual Words” (BoW), and these are entered into an index for later querying and retrieval. The approach is capable of efficient recognition in very large collection of images. For example, identification of a small region selected by the user in a collection of 4 thousand images takes 0.1 seconds.
Although the results of the “Video Google” were very impressive, especially when compared to other methods available at the time, searching for entire scenes or even large regions was still prohibitively slow. For example, matching scenes represented using images of size 720×576 pixels in the collection of 4 thousands images took approximately 20 seconds [SIV06]. This limitation was alleviated to a certain extend by Nister and Stewenius [NS06] who proposed a highly optimized image based search engine able to perform close to real-time image recognition in larger collections. In particular, their system was capable of providing good recognition results of 40000 CD covers in real-time.
Finally, very recently, Philbin et al. [PCI+07, PCI+08] proposed an improved variant of the “Video Google” approach and demonstrated that it is able to rapidly retrieve images of 11 different Oxford “landmarks” from a collection of 5 thousands high resolution (1024×768) images collected from Flickr [FLI].
The recent spectacular advances in the area of visual object recognition are starting to attract a great interest from the industry. Currently several companies are offering technologies and services based, at least partially, on the above-mentioned advances.
Kooaba [KOO], a spin-off company from the ETH Zurich founded at the end of 2006 by the inventors of the SURF approach [BTG06], uses object recognition technology to provide access and search for digital content form mobile phones. Kooaba's search results are accessed by sending a picture as a query. They advocate their technology as allowing to literally “click” on real-world objects such as movie posters, linked articles in newspapers or magazines, and in the future even on tourist sights.
Evolution Robotics in Pasadena, Calif., [EVO] developed a visual search engine able to recognize what the user took a picture of, and then advertisers can use that to push relevant content to user's cellphone. They predict that within the next 10 years one will be able to hold up his cellphone and it will visually tag everything in front of him. One of the advisors of Evolution Robotics is Dr. David Lowe, the inventor of the SIFT approach [LOW99].
SuperWise Technologies AG [SUP], company that has developed the Apollo image recognition system, developed a novel mobile phone program called eye-Phone, able to provide the user with tourist information whenever he is. In other words, eye-Phone can provide information on what the user sees when he sees it. The program combines three of today's modern technologies: satellite navigation localization services, advanced object recognition and relevant Internet retrieved information. With the eye-Phone on his phone, for instance while out walking, the user can take a photograph with his mobile phone and select the item of interest with the cursor. The selected region is then transmitted with satellite navigation localization data to a central system performing the object recognition and interfacing to databases on the Internet to get information on the object. The information found is sent back to the phone and displayed to the user.
Existing approaches have relevant limitations. Currently, only methods relying on local image features appear to be close to fulfilling most of the requirements needed for a search engine that delivers results in response to photos.
One of the first systems belonging to this category of methods and performing real-time object recognition with a collections of tens of images was proposed by David Lowe, the inventor of SIFT [LOW99, LOW04]. In the first step of this approach key-points were matched independently to the database of key-points extracted from reference images using an approximate method for finding nearest neighbours termed Best-Bin-First (BBF) [BL97]. These initial matches were further validated in the second stage by clustering in pose space using the Hough transform [HOU62]. This system appears to be well suited for object recognition in the presence of clutter and occlusion, but there is no evidence in the literature that it can scale to collections larger than tens of images.
To improve scalability, other researchers proposed to use feature points in a way, which mimics text-retrieval systems [SZ03, SIV06]. For example, Sivic and Zisserman [SZ03, SIV06, PCI+07, PCI+08] proposed to quantize key-point descriptors by k-means clustering creating the so-called “Vocabulary of Visual Words”. The recognition is performed in two stages. The first stage is based on the vector-space model of information retrieval [BYRN99], where the collection of visual words are used with the standard Term Frequency Inverse Document Frequency (TF-IDF) scoring of the relevance of an image to the query. This results in an initial list of top n candidates potentially relevant to the query. It should be noted that typically, no spatial information about the image location of the visual words is used in the first stage. The second step typically involves some type of spatial consistency check where key-point spatial information is used to filter the initial list of candidates. The biggest limitation of approaches from this category originates from their reliance on TF-IDF scoring, which is not particularly well suited to identifying small objects “buried” in cluttered scenes. Identification of multiple small objects requires accepting much longer lists of initial matching candidates. This results in increase of the overall cost of matching since the consecutive validation of spatial consistency is computationally expensive when compared to the cost of the initial stage. Moreover, our experiments indicate that these types of methods are ill suited to identification of many types of real products, such as for example soda cans or DVD boxes, since the TF-IDF scoring is often biased by key-points from borders of the objects which are often assigned to visual words that are common in scenes containing other man-made objects.
Because of the computational cost of the spatial consistency validation step, Nister and Stewenius [NS06] concentrated on improving the quality of the pre-geometry stage of retrieval, which they suggest is crucial in order to scale up to large databases. As a solution, they proposed hierarchically defined visual words that form a vocabulary tree that allows more efficient lookup of visual words. This enables use of much larger vocabularies which shown to result in an improvement in quality of the results, without involving any consideration of the geometric layout of visual words. Although this approach scales very well to large collections, so far it has been shown to perform well only when the objects to be matched cover most of the images. It appears that this limitation is caused by the reliance on a variant of TF-IDF scoring and the lack of any validation of spatial consistency.