1. Field of the Invention
The present invention is directed generally to digital image processing, and more particularly to analysis, characterization, and representation of images.
2. Description of the Related Art
Digital images may include raster graphics, vector graphics, or a combination thereof. Raster graphics data (also referred to herein as bitmaps) may be stored and manipulated as a grid of individual picture elements called pixels. A bitmap may be characterized by its width and height in pixels and also by the number of bits per pixel. Commonly, a color bitmap defined in the RGB (red, green blue) color space may comprise between one and eight bits per pixel for each of the red, green, and blue channels. An alpha channel may be used to store additional data such as per-pixel transparency values. Vector graphics data may be stored and manipulated as one or more geometric objects built with geometric primitives. The geometric primitives (e.g., points, lines, polygons, Bézier curves, and text characters) may be based upon mathematical equations to represent parts of digital images.
How to represent an image is a fundamental problem in many image (including video) analysis and synthesis applications. Images are often analyzed and characterized in terms of their respective features, e.g., the presence of human faces (or not), as well as more primitive attributes such as edges, blobs, corners, etc. In particular, current high-performance image search/retrieval applications generally rely on a bag-of-features (BoF) representation, e.g., characterization based on an unordered collection of features, based on local features of images.
In addition to characterizing the local image appearance and structure, these local features are usually designed to be robust with respect to changes in rotation, scale, illumination and difference in viewpoint. These approaches first localize the features at a sparse set of distinctive image points—usually called interest points or key points; this process is performed by a feature detector. Then the feature vectors are computed based on the local image patches centered at these locations, this process is performed by a feature descriptor. The hypothesis is that detectors select stable and reliable image locations which are informative about image content and the descriptor describes the local patch in a distinctive way with a feature vector (usually a much lower dimension as the original patch). The overall performance of the local feature depends on the reliability and accuracy of the localization and distinctiveness of the description. Two major advantages of sparse local features is the robustness to changes in viewpoints, lighting, etc., and the compactness of the representation.
The bag-of-features (BoF) representation is an extension of the bag-of-words representation used in text classification. The basic idea of BoF is to represent each image as an unordered collection of local features. For compact representation in BoF, a “visual vocabulary” is usually constructed in a training phase via the clustering of local feature descriptors with respect to a collection of images. Each feature descriptor cluster is treated as a “visual word” in the visual vocabulary. Through mapping the feature descriptors in an image to the visual vocabulary, the image may be described with a feature vector according to the presence frequency of each visual word. This so-called term-frequency vector is then used to perform high level inference, such as image categorization/recognition/retrieval. The BoF approach, although it is simple and does not contain any geometry information, has demonstrated excellent performance for various visual classification and retrieval tasks.
The key component in the bag-of-features representation is the construction of the visual vocabulary. Existing approaches include k-means, hierarchical k-means, approximate k-means, and regular quantization, among others. Regular quantization is very fast but performs poorly for image retrieval. The approximate k-means (AKM) performs well for image search, where a randomized forest for cluster centers is built to accelerate the nearest neighbor search for each data point. The complexity of this algorithm is O(MN log K) where M is the dimensionality of the feature and N is the number of data points and K is the number of clusters. However, this approach is still very slow due to the fact that the feature representation is high dimensional (e.g., 128 dimensional scale-invariant image transform (SIFT) descriptors), and the number of features is huge, e.g., for a typical 5000 image database (with typical image resolution of 1024×768), the number of SIFT features is more than 26 million. Additionally, using these approaches, it is difficult to handle a large amount of data in a reasonable amount of time, and to cluster such a large number of features.