1. Field of the Invention
This invention relates to the computation of robust, discriminative, scalable and compact image descriptors.
More in particular, the present invention refers to image descriptors computed in local regions around image interest points by computing histograms of gradients of subregions within said local regions.
2. Present State of the Art
Image descriptors have found wide applicability in many computer vision applications including object recognition, content-based image retrieval, and image registration, to name a few. One of the most widely known examples of this class of image descriptors is the Scale Invariant Feature Transform (SIFT) descriptor.
Briefly, with the SIFT method, local image descriptors are formed as follows: first, a search across multiple images scales and locations is performed to identify and localise stable image keypoints that are invariant to scale and orientation; then, for each keypoint, one or more dominant orientations are determined based on local image gradients, allowing the subsequent local descriptor computation to be performed relative to the assigned orientation, scale and location of each keypoint, thus achieving invariance to these transformations.
Then, local image descriptors around keypoints are formed as follows: first, gradient magnitude and orientation information is calculated at image sample points in a region around the keypoint; then, these samples are accumulated into orientation histograms summarizing the contents over n×n subregions. By way of illustration only, an example of a keypoint descriptor is shown in FIGS. 1a and 1b, where FIG. 1a shows a subdivision of the local region R into 4×4 subregions SR and FIG. 1b shows a subdivision of the 360° range of orientations into eight bins h for each orientation histogram h, with the length of each arrow corresponding to the magnitude of that histogram entry.
Thus, a local image descriptor as illustrated in FIG. 1a has 4×4×8=128 elements. The SIFT method is presented in greater detail in David G. Lowe, “Distinctive image features from scale-invariant keypoints”, International Journal of Computer Vision, 60, 2 (2004), pp. 91-110.
A number of alternatives and variations of the SIFT method exist, employing different mechanisms for the detection of stable image keypoints, different approaches to the subdivision of the local region around keypoints and different approaches to the computation of subregion gradient histograms. For example, FIGS. 2a and 2b respectively show log-polar spatial subdivisions characteristic of other techniques like the Gradient Location Orientation Histogram (GLOH) described in K. Mikolajczyk and C. Schmid, “A performance evaluation of local descriptors”, IEEE Transactions of Pattern Analysis and Machine Intelligence 27(10):1615-1630, and the Uncompressed Histogram of Gradients (UHoG) described in Chandrasekhar et al., “Compressed Histogram of Gradients: A Low-Bitrate Descriptor”, International Journal on Computer Vision, Vol. 94, No. 5, May 2011, as alternatives to the Cartesian spatial subdivision employed in the SIFT method.
As another example, FIGS. 3a and 3b show approaches for the computation of gradient histograms based on a subdivision of the 2-dimensional space of the x and y components of the gradients into bins, characteristic of UHoG, as an alternative to the subdivision of the 360° range of gradient orientations into bins which is employed in the SIFT method.
The above mentioned prior art techniques are considered here only by way of example of techniques producing image descriptors based on which the present invention performs computation of robust, discriminative, scalable and compact image descriptors.
Although such image descriptors have found wide applicability in many computer vision applications as discussed earlier, their storage and transmission costs, as defined by their size in bytes, are commonly considered high in certain application areas. This is because, although the size of a local image descriptor for a keypoint in an image may be relatively low, the entire image descriptor will comprise hundreds of such keypoints and associated local descriptors, meaning the entire image descriptor can have a size comparable to a JPEG compressed version of the actual image from which it is extracted. One such application area where this level of descriptor size is considered problematic is visual search using mobile terminals. Although different architectures are feasible in this application area, one typical architecture entails capture of an image of an object of interest by a mobile terminal client such as a mobile phone, automatic extraction of an image descriptor by the client, transmission of the image descriptor over a wireless communication network to a server which will process the image descriptor and provide an appropriate response, such as the identity or additional information regarding the object of interest, and a return of said response to the client. Thus, it is obvious that minimisation of the amount of information transmitted from the client to the server over the wireless network is desirable. For the benefit of such applications, there has been a significant amount of development in the compression of such image descriptors.
The simplest approach towards compressing a histogram of gradient based keypoint descriptor is by scalar quantisation of the histogram bin values, which means reducing the number of bits used in the representation of each bin value individually. In practice, this approach is not commonly used because it is difficult to achieve very high compression rates without significantly compromising the discriminative power of the descriptor. For example, encoding of SIFT descriptor histogram bins with eight bits per bin is commonly used, but results in image descriptors whose size in bytes is commonly considered too large for transmission over wireless networks. On the other hand, scalar quantisation to just a few, for example just one or two, bits per bin has been found to compromise the discriminative power of the image descriptor.
Therefore, more complex compression schemes have been proposed. A review of such schemes is presented in V. Chandrasekhar et al., “Survey of SIFT compression schemes”, Proceedings of International Conference on Pattern Recognition (ICPR), Istanbul, Turkey, August 2010.
Briefly, schemes revolving around vector quantisation, whereby the bin values are jointly quantised by mapping them to one of a finite number of representative vector centroids, have been particularly popular and investigated in various forms, such as tree-structured and product vector quantisation. The drawback of such approaches is that they entail a relatively high computational complexity and quite significant memory requirements, from hundreds of kilobytes to several megabytes or more, for the storage of the centroids, the number of which can range from thousands to millions, and the determination of which also requires a computationally complex training phase.
Schemes revolving around type coding have also been thoroughly investigated, whereby bin values are again jointly quantised by forming a uniform lattice of types within the space containing all possible input vectors and, for any given input vector, encoding it by the index of the type which is closest to it. The memory requirements of such approaches are reduced compared to vector quantisation approaches, but it has also been found that the resultant compressed descriptors do not compare well to vector quantised descriptors in terms of recognition performance at high compression rates. Overall, the computational costs associated with type coding are significantly higher than for simple scalar quantisation.
Other compression schemes utilise known dimensionality reduction methods, such as PCA, on keypoint descriptors, for example 128-dimensional SIFT keypoint descriptors, followed by scalar quantisation of the resultant dimensions. A key problem with such approaches is that they entail high computational complexity and a high risk of overtraining.
To sum up, existing approaches to the compression of histogram of gradient based descriptors and the generation of robust, discriminative, scalable and compact image descriptors exhibit certain drawbacks.
A simple approach such as scalar quantisation of the descriptor elements has the benefit of very low computational complexity and memory requirements, but has been found to compromise the discriminative power of the descriptors at high compression rates.
More complex approaches have been shown to achieve better performance at high compression rates, but suffer different drawbacks. Vector quantisation approaches have significantly increased computational complexity and memory requirements. Type coding approaches entail increased complexity and, while not burdened by the memory requirements of the vector quantisation approaches, have also been found to underperform compared to such approaches. Furthermore, neither vector quantisation nor type coding approaches are well suited to dimensionality reduction in the compressed domain. Approaches based on known dimensionality reduction techniques, such as PCA, have also been employed, but also suffer from high computational complexity and a high risk of overtraining.