1. Field of the Invention
The present invention relates to a method for efficiently encoding, transcoding, decoding and processing image descriptors computed in local regions around image interest keypoints and to an image processing device comprising means for encoding, transcoding, decoding and processing such descriptors.
2. Present State of the Art
Such image descriptors have found wide applicability in many computer vision applications including object recognition, content-based image retrieval, and image registration, to name a few.
Existing approaches to the encoding of such descriptors exhibit certain drawbacks.
For example, existing encoding approaches result in descriptors which require parsing of the whole descriptors to perform transcoding, whereby a descriptor of a given descriptor length is converted to a descriptor of a different descriptor length, or to perform decoding and comparison of descriptors of different lengths.
As another example, existing encoding approaches are inefficient in terms of encoding complexity because they ignore the commonalities and redundancies in the operations which are required to produce variable-length image descriptors.
The not yet published Italian patent application no. TO2012A000602 filed by the Applicant itself describes the encoding of local image descriptors, whereby robust, discriminative, scalable and compact image descriptors are computed from image descriptors employing histograms of gradients based on the transformation of said histograms of gradients, where said transformation captures the salient and robust information contained therein in the form of the shape of the distributions and the relationship among their bin values.
In said not yet published Italian patent application encoding methods of said descriptors are disclosed which are more efficient than the prior art methods in terms of producing easily scalable bitstreams.
Such descriptors are disclosed in the above mentioned not yet published Italian patent application no. TO2012A000602 which discloses the computation of robust, discriminative, scalable and compact image descriptors from image descriptors employing histograms of gradients based on the transformation of said histograms of gradients, where said transformation captures the salient and robust information contained therein in the form of the shape of the distributions and the relationship among their bin values.
Important aspects of the computation of robust, discriminative, scalable and compact image descriptors from image descriptors employing histograms of gradients, in particular a SIFT image descriptor, according to the not yet published Italian patent application no. TO2012A000602 are hereinbelow described.
Briefly, with the SIFT method, local image descriptors are formed as follows: first, a search across multiple images scales and locations is performed to identify and localise stable image keypoints that are invariant to scale and orientation; then, for each keypoint, one or more dominant orientations are determined based on local image gradients, allowing the subsequent local descriptor computation to be performed relative to the assigned orientation, scale and location of each keypoint, thus achieving invariance to these transformations. Then, local image descriptors around keypoints are formed as follows: first, gradient magnitude and orientation information is calculated at image sample points in a region around the keypoint; then, these samples are accumulated into orientation histograms summarizing the contents over n×n subregions.
By way of illustration only, an example of a SIFT keypoint descriptor is shown in FIGS. 1a and 1b, where FIG. 1a shows a subdivision of a local region R into 4×4 subregions SR and FIG. 1b shows a subdivision of the 360° range of orientations into eight bins for each orientation histogram, with the length of each arrow corresponding to the magnitude of that histogram entry. Thus, a local image descriptor as illustrated in FIG. 1 has 4×4×8=128 elements. More details of the SIFT technique can be found in David G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, 60, 2 (2004), pp. 91-110.
According to the not yet published Italian patent application no. TO2012A000602, a robust, discriminative, scalable and compact image descriptor may be calculated from a SIFT descriptor as follows.
In the following description, H in an entire SIFT descriptor comprising 16 histograms of gradients h each with eight bins h, whereas V is an entire local descriptor according to the present invention comprising 16 subdescriptors v each with eight elements v.
Let H denote a SIFT local image descriptor comprising 16 histograms of gradients h0-h15, as shown in FIG. 2a, each histogram comprising eight bin values h0-h7, as shown in FIG. 2b. A more robust, discriminative, scalable and compact image descriptor may be computed by transforming each of h0-h15 of H and then performing scalar quantisation on the resultant transformed values. More specifically, each of h0-h15 is transformed according to Transform A or Transform B, as shown below, according to the transform utilisation information of FIG. 3, i.e., Transform A is applied to h0, h2, h5, h7, h8, h10, h13, h15 and Transform B is applied to h1, h3, h4, h6, h9, h11, h12, h14, giving the transformed descriptor V with subdescriptors v0-v15, corresponding to h0-h15 respectively, and each comprising elements v0-v7, giving a total of 128 elements.
Transform Av0=h2−h6 v1=h3−h7 v2=h0−h1 v3=h2−h3 v4=h4−h5 v5=h6−h7 v6=(h0+h4)−(h2+h6)v7=(h0+h2+h4+h6)−(h1+h3+h5+h7)  (1)
Transform Bv0=h0−h4 v1=h1−h5 v2=h7−h0 v3=h1−h2 v4=h3−h4 v5=h5−h6 v6=(h1+h5)−(h3+h7)v7=(h0+h1+h2+h3)−(h4+h5+h6+h7)  (2)
Then, each element undergoes coarse scalar quantisation, for example ternary (3-level) quantisation, with the quantisation thresholds selected so as to achieve a specific occurrence probability distribution among the quantisation bins for each element. This scalar quantisation produces the quantised descriptor V,{tilde over ( )} with subdescriptors v,{tilde over ( )}0-v,{tilde over ( )}15, each comprising elements v,{tilde over ( )}0-v,{tilde over ( )}7, again with a total of 128 elements. This compact descriptor captures the most discriminative and robust information contained in the original histograms of gradients, in the form of the shape of the distributions and the relationship among their bin values.
A key advantage of descriptor V, as well as its quantised version V,{tilde over ( )}, is that it is highly scalable, and its dimensionality may be easily reduced if required by an application's storage requirements or a transmission channel's characteristics by simply eliminating one or more of its elements. For the sake of simplicity, in the description that follows there will be described important aspects of the invention in terms of the encoding of pre-quantised descriptor V with subdescriptors v0-v15, each comprising elements v0-v7 and, unless otherwise stated, it should be understood that the encoding of the quantised descriptor V,{tilde over ( )} proceeds in a similar manner.
FIGS. 4a-4e show exemplary sets of elements which have been found to produce excellent discriminative power and robustness for five target descriptor lengths, from descriptor length 0 (DL0), the shortest descriptor length utilising only 20 descriptor elements, to descriptor length 4 (DL4), the longest descriptor length utilising all 128 elements. More specifically, FIG. 4a shows an exemplary set of elements for descriptor length DL0 comprising 20 elements, FIG. 4b shows an exemplary set of elements for descriptor length DL1 comprising 40 elements, FIG. 4c shows an exemplary set of elements for descriptor length DL2 comprising 64 elements, FIG. 4d shows an exemplary set of elements for descriptor length DL3 comprising 80 elements, and FIG. 4e shows an exemplary set of elements for descriptor length DL4 comprising all 128 elements. Thus, for each descriptor length, each element of each subdescriptor will or will not be encoded according to the element utilisation sets of FIG. 4a-4e. 
Key to this scalability property is that the set of utilised elements for each descriptor length must be the same as or a subset of the set of utilised elements for all higher descriptor lengths, as illustrated in FIGS. 4a-4e. This allows the transcoding and comparison of descriptors of different lengths by simple elimination of the excess elements of the descriptor with the higher descriptor length so that it is reduced to the same set of elements as the descriptor with the lower descriptor length.
A straightforward encoding method of this descriptor comprises calculating and encoding the elements in a “by-subdescriptor” order, i.e., in the general case as v0,0, v0,1, . . . , v0,7, v1,0, v1,1, . . . , v1,7, . . . , v15,0, v15,1, . . . , v15,7 where vi,j denotes element vj of subdescriptor vi. This means encoding elements v0, v1, . . . , v7 for transformed histogram v0, then encoding elements v0, v1, . . . , v7 for transformed histogram v1, etc., using the appropriate transforms, for example as illustrated in FIG. 3, and also using the appropriate element utilisation sets for the desired descriptor length, for example as illustrated in FIG. 4, to decide which elements should be encoded.
This encoding results, for example for a descriptor length DL0, to a descriptor v0,0, v1,0, v2,0, v3,0, v4,0, v5,0, v5,6, v6,0, v6,6, v7,0, v8,0, v9,0, v9,6, v10,0, v10,6, v11,0, v12,0, v13,0, v14,0, v15,0 and for a descriptor length DL1 to a descriptor v0,0, v0,1, v1,0, v1,1, v2,0, v2,1, v3,0, v3,1, v4,0, v4,1, v5,0, v5,1, v5,2, v5,6, v6,0, v6,1, v6,2, v6,6, v7,0, v7,1, v8,0, v8,1, v9,0, v9,1, v9,2, v9,6, v10,0, v10,1, v10,2, v10,6, v11,0, v11,1, v12,0, v12,1, v13,0, v13,1, v14,0, v14,1, v15,0, v15,1.
FIG. 5 illustrates the operation of such a straightforward encoder as a sequence of steps. In the following description, as well as in subsequent descriptions of an encoder's operation, unless otherwise specified, such a sequence of steps corresponds to steps which are conceptual and do not correspond to specific hardware of software implementations, components and instructions, but are representative of the overall operation of the encoder. More specifically, FIG. 5 illustrates the operation of an encoder for a descriptor length DLk, for example corresponding to one of the descriptor lengths illustrated in FIG. 4. In step S100 of FIG. 5, the encoding of the descriptor begins at the first subdescriptor, i.e., v0. In step S110, the appropriate transform is selected for the subdescriptor being processed, for example according to the transform utilisation of FIG. 3. It should be noted that the computation of descriptor V from descriptor H according to two different transforms as described here is only an example. The computation of descriptor V from descriptor H may also be performed according to a single transform, for example only Transform A or only Transform B, rendering step S110 unnecessary, or according to more than two transforms. In step S120, the encoding of the subdescriptor being processed begins at the first subdescriptor element, i.e., v0. Then, in step S130, the use or not of the particular element of the particular subdescriptor, i.e., v0,0 is checked against the element utilisation information for descriptor length DLk, for example using one of the utilisation sets of FIG. 4. If the element is not in use, then processing moves to step S150. If the element is in use for the descriptor length DLk, then its encoding takes place in step S140. Here, as well as in subsequent descriptions of an encoder's operation, unless otherwise specified, the word “encoding” means one or more actions, or combination thereof, that make the element v0,0 part of the local image descriptor, said actions including, but way of example and without limitation, the calculation according to the appropriate transform function of (1) or (2) seen earlier, the selection of the element for inclusion into the local image descriptor in the case all elements are pre-calculated without knowledge of which elements will be finally used in the descriptor, the quantisation of the element value, the storage of the element in volatile or non-volatile memory and the transmission of the element along a transmission channel. After step S140, or if it was decided that the element is not in use for the descriptor length DLk in step S130, the processing moves to step S150. In step S150, if the current element is not the last element of the subdescriptor, the processing moves to the next element of the subdescriptor, otherwise the processing moves to step S160. In step S160, if the current subdescriptor is not the last subdescriptor of the local image descriptor, the processing moves to the next subdescriptor of the local image descriptor, otherwise the processing ends. Thus, it is clear that steps S100, S120, S150, and S160 relate to the order in which the processing is performed, while steps S110, S130 and S140 relate to the actual encoding of the local image descriptor.
Another straightforward encoding method of this descriptor comprises calculating and encoding the elements in a “by-element” order, i.e., in the general case as v0,0, v1,0, . . . , v15,0, v0,1, v1,1, . . . , v15,1, . . . , v0,7, v1,7, . . . , v15,7 i.e. encoding element v0 for subdescriptors v0, v1, . . . , v15, then encoding element v1 for subdescriptors v0, v1, . . . , v15, etc., again using the appropriate transforms, for example as illustrated in FIG. 3, and also using the appropriate element utilisation sets for the desired descriptor length, for example as illustrated in FIG. 4, to decide which elements should be encoded. Such an encoder may operate in an analogous fashion to the encoder of FIG. 5, with the appropriate reordering of steps. In general, neither of the two aforementioned methods offers an advantage over the other method. For the purposes of transcoding, decoding and processing, the decoder must also know the encoding process and the element ordering and utilisation sets to be able to process and compare descriptors, possibly of different lengths, for the purposes of the related computer vision applications. Thus, the element utilisation sets must be either permanently fixed or stored/transmitted alongside the descriptors. In this context, the straightforward encoding process is disadvantageous.
More specifically, such an encoding ignores the relative importance between different elements in the encoding order. Consequently, in terms of transcoding, whereby a descriptor of a given descriptor length is converted to a descriptor of a different descriptor length, or in terms of decoding and comparing descriptors of different lengths by comparing corresponding elements between the two descriptors, such an encoding necessitates parsing of the descriptors to achieve the desired result.
Furthermore, such an encoding ignores the redundancy patterns in the relative importance between different elements and is unnecessarily complex with regards to deciding whether specific elements should be encoded or not.