With remarkable advances of computer technologies and the increasing popularity of digital cameras and digital video cameras, it is common for an individual to possess a large database of digital images, and how to efficiently retrieve desired images from the image database becomes an increasingly important topic in computer vision.
Content-Based Image Retrieval (CBIR) addresses the matching and retrieval of images that share similar visual contents with a given image. The most common CBIR method for comparing two images in content-based image retrieval (typically an example image and an image from the database) is using an image distance measure. An image distance measure compares the similarity of two images in various dimensions such as color, texture, shape, and others. More advanced CBIR systems retrieve images by statistically attaching linguistic indexes and retrieving by index association or using learned mappings in feature space to group similar images. However, the CBIR system may have limitation when handling images with scale or rotation variations.
For any object in an image, interesting points on the object can be extracted to provide a “feature description” of the object, which can then be used to identify the object when attempting to locate the object in another image including many other objects. To perform reliable recognition, it is important that the extracted features are detectable even under changes in image scale, noise and illumination. Scale Invariant Feature Transforms (SIFT), developed by David Lowe in 1999, is an algorithm that is invariant to rotation, translation and scale variation between images. In other words, the key points extracted and described by SIFT were robustly invariant to common image transforms. However, there may be a significant amount of data generated by using the SIFT method and high computational cost to process these data may be involved.
SURF (Speeded Up Robust Feature), another algorithm similar to SIFT, was proposed by Herbert Bay et al. in 2006 to ensure high speed in three of the feature steps: detection, description and matching. Utilizing the Hessian matrix, SURF significantly increases the processing speed without sacrificing the quality of detection points. As shown in FIG. 1, SURF can complete the detection process of the wine tag in 78.2 ms, while SIFT is taking a longer time of 855.09 ms. More specifically, SURF algorithm may include a feature extraction process, and a so called “blob detection” technique is often used to extract key features. Blob detection refers to a feature finding method that detects points and/or regions in an image having different properties such as brightness, color, etc. compared with the environment, and the Hessian matrix is used to generate the blobs to locate key features. As can be seen in FIG. 2, a plurality of blobs are generated utilizing the Hessian matrix to identify key features in a sunflower field image (Herbert Bay, Andreas Ess, Tinne Tuytelaars, Luc Van Gool “SURF: Speeded Up Robust Features”, Computer Vision and Image Understanding (CVIU), Vol. 110, No. 3, pp. 346-359, 2008, which is incorporated herein by reference). However, SURF still has its limitations especially when the resolution of the image is low.
In addition to SIFT and SURF that are commonly used in the computer vision community, a recognition scheme called “Scalable Recognition with a Vocabulary Tree” has been proposed recently. FIG. 3 illustrates an example for the vocabulary tree scheme, where a large number of elliptical regions 301 are extracted from the image and a descriptor vector is computed for each region (www.es.ualberta.ca/˜vis/vision06/slides/birs2006-nister-index.pdf, which is incorporated herein by reference). The vocabulary tree is then used to hierarchically quantize the descriptor vector into several quantization layers. For example, in the first quantization layer, the descriptor is assigned to the closest three closest green centers 302 to 304, and in the second quantization layer, it is assigned to the closest three blue descendants to the green centers 302 to 304. In the vocabulary tree, each node is associated with an inverted file with references to the images containing an instance of the node, and the images in the database are hierarchically scored using the inverted files at multiple levels of the vocabulary tree. However, the noise level is usually high in the vocabulary tree scheme and some quantization errors are embedded. Furthermore, the vocabulary tree scheme cannot efficiently capture local features of the image.
Therefore, there remains a need for a new and improved method and apparatus to efficiently retrieve images, especially when the quality of the image may be affected by scale, orientation, background clutter, lighting conditions, shape of the image, etc.