Visual Search (VS) is referred to as the capability of an automated system to identify an object or objects depicted in an image or in a sequence of images by only analyzing the visual aspects of the image or the sequence of images without exploiting any external data such as textual description, metadata, etc. Augmented Reality (AR) can be considered an advanced usage of VS and applied to the mobile domain. After the objects depicted in a sequence of images have been identified, additional content such as normally synthetic objects are superimposed to the real scene thereby ‘augmenting’ the real content with a position consistent to the real objects. The enabling technology for identifying objects depicted in the sequence of images is the same. In the following, the terms image and picture are synonymously used.
Currently, the predominant method of visual search relies on determining so called local features, which are also referred to as features or descriptors. Common methods are Scale-Invariant Feature Transforms (SIFT) as described in “D. Lowe, Distinctive Image Features from Scale-Invariant Keypoints, Int. Journal of Computer Vision 60 (2) (2004) 91-110. H.” and Speeded Up Robust Features (SURF) and in “Bay, T. Tuytelaars, L. V. Gool, SURF: Speeded Up Robust Features, in: Proceedings of European Conference on Computer Vision (ECCV), Graz, Austria, 2006, http://www.vision.ee.ethz.ch/˜surf/”. In literature it is possible to find many variations of those technologies that can be considered improvements of those two original technologies.
As can be seen from FIG. 13, a local feature is a compact description, e.g., 128 Bytes for each feature in SIFT of a patch 1303 surrounding a point 1305 in an image 1301. FIG. 13 shows an example of extraction (upper part of FIG. 13) and representation (lower part of FIG. 13) of local features. In the upper part of FIG. 13, the position of the points where the local feature is computed is indicated by a circle representing the point 1305 in the image 1301 The circle is surrounded by a square representing the oriented patch 1303. In the lower part of FIG. 13, a grid 1309 subdivision of the patch 1303 contains histogram components 1311 of the local feature. In order to compute a local feature, a main orientation 1307 of the point 1305 is computed based on the main gradient component in the point's 1305 surrounding. Starting from this orientation 1307, a patch 1303 oriented towards the main orientation 1307 is extracted. This patch 1303 is then subdivided into a rectangular or radial grid 1309. For each element of the grid 1309, a histogram 1311 of the local gradients is computed. The histograms 1311 computed for the grid 1309 elements represent the components of the local feature. Characteristic of such descriptor 1313 containing the histograms 1311 of the grid 1309 elements as illustrated in the lower part of FIG. 13 is to be invariant to rotation, illumination, and perspective distortions.
In an image 1301, the points 1305 upon which descriptors 1313 are computed normally relate to peculiar elements of the scene, e.g., corners, specific patterns, etc. Such points are normally called key points 1305, which are the circles depicted in the upper part of FIG. 13. The process of computation of the key points 1305 is based on the identification of local extrema in a multi-scale image 1301 representation.
When two images 1301, 1401 are compared as shown in FIG. 14, each descriptor 1313 of the first image 1301 is compared against each descriptor of the second image 1401. FIG. 14 illustrates only the images 1301, 1401 and not the descriptors. Adopting a distance measure, matchings are identified between different key points, e.g., between a first key point 1305 in the first image 1301 and a second key point 1405 in the second image 1401. The correct matchings, normally called inliers 1407, need to have consistent relative positions despite possible scaling, rotation, and perspective distortions in the images 1301, 1401. Errors in the matching phase, which might happen due to the statistical approach adopted for key point extraction, are then eliminated through a phase called geometric consistency check where the consistency of the position of different key points is estimated. The errors, normally called outliers 1409, are removed as illustrated by the dashed lines in FIG. 14.
According to the number of remaining inliers 1407, estimation about the presence of the same object in the two images 1301, 1401 can be performed.
In a VS pipeline system 1500 representing typical client-server service architecture, as illustrated in FIG. 15, descriptors are computed on a client device 1501 by a procedure of key point identification 1505, features computation 1507, features selection 1509 as described below, and encoding 1511. The descriptors are sent to a server 1503 that matches 1513 those descriptors 1519 against the descriptors, i.e., reference descriptors 1521 extracted from the reference images on the database. In detail, the data stream 1515 from the client 1501 is decoded 1517 to obtain the descriptors 1519 of the original image that are matched 1513 against the reference descriptors 1521 computed by key point identification 1523 and features computation 1525 from the reference images on the database. After the matching 1513 a geometric consistency check 1527 is applied for checking the geometric consistency of the reconstructed image.
Thousands of features can be extracted from an image. This may result in a considerable amount of information, e.g., several Kilobytes per image, being sent over the network. In some scenarios, the bit-rate required for sending the descriptors can be larger than the compressed image itself.
This represents a problem for real-time applications due to possible network delays in the client/server link and the amount of memory required on the server side where descriptors of millions of reference images are kept in memory at the same time. Therefore, the need for compressed versions of the descriptors is rising. Two steps are needed to enable descriptor compression starting from uncompressed descriptors. The first step is a mechanism of key point selection as follows: not all the descriptors extracted from the image are sent to the server, but only those that, according to a statistical analysis, are less error-prone during the matching phase and refer to points considered more distinctive for the depicted object. The second step is a compression algorithm applied to the remaining descriptors.
Moving Pictures Experts Group (MPEG) standardization is currently defining a new part of the standard MPEG-7 (ISO/IEC 15938—Multimedia content description interface), part 13, dedicated to the development of a standardized format of compressed descriptors. In order to test the compression capabilities of the emerging standard, six operating points, representing the bit rate necessary to store or send all the descriptors extracted from an image, have been identified as 512-1024-2048-4096-8192-16384 Bytes. The testing phase is conducted using those operating points as reference. Due to the application of the key point selection mechanism, a different number of key points will be transmitted to the server at those operating points. This number may span between 114 key points at the lowest operating point to 970 key points at the highest operating point.
When descriptor compression is applied to descriptors, two different kinds of information are compressed. The first one relates the values of the descriptor. The second one is the location information of the descriptors, i.e., the x/y position, which represents the Cartesian coordinates of the key points in the image.
In the current Reference Model (RM) of the VS standard, as well as in the vast majority of the VS algorithms existing in literature, before the descriptor extraction phase, the image is scaled to Video Graphics Array (VGA) resolution, which is 640×480 pixels. VGA resolution is hereinafter referred to as full resolution.
Therefore, a native x/y couple describing the position of a single key point in the image can occupy 19 bits. This is unacceptable, in particular at the lowest operating points. Therefore, compression of location information is needed in order to allocate more bits for inserting more descriptors or applying less restrictive compression algorithms to the descriptors.
The key points coordinates are represented in floating points values in the original non-scaled image resolution. Since the first operation applied to every image is the downscale to VGA resolution, the key points coordinates are rounded to integer values in VGA resolution, which is 19 bits natively. Therefore, it might happen that several points are rounded to the same coordinates. It is also possible to have two descriptors computed exactly on the same key point with two different orientations. This first rounding has negligible impact on the retrieval performances.
FIG. 16 depicts an example of such a rounding operation where each square cell 1603, 1605 corresponds to a 1×1 pixel cell at full resolution. An image 1600 can be created where non-null pixels correspond to the position of the key points. The image 1600 is then partitioned into a pixel cell representation 1601, which can be represented by a matrix representation 1602. Values of these square cells 1603, 1605, e.g., 2 for the first square cell 1603 and 1 for the second square cell 1605 as depicted in FIG. 16, are represented in a matrix 1602 where non-null cells 1607, 1609 represent key points' position, e.g., a first non-null cell 1607 corresponding to the first square cell 1603 and a second non-null cell 1609 corresponding to the second square cell 1605. Consequently, the problem can be reformulated as the need to compress a matrix 1602 of 640×480 elements, with the characteristic of being extremely sparse, i.e., less than 1000 non-null cells, even at the highest operating point. For compressing this matrix, there is the need to represent two different kinds of information: a Histogram map, which is a binary map of empty and non-empty cells and a Histogram count, and a vector containing the number of occurrences in each non-null cell. The Histogram map is represented by the binary format of the pixel cell representation 1601 depicted in FIG. 16 and the Histogram count is represented by the vector created by the non-null elements of matrix representation 1602 depicted in FIG. 16. For improving compression efficiency, these two elements are always encoded separately in the existing literature.
In the existing literature, a lossy technique encompassing block quantization is applied to the histogram map to improve compression efficiency. Normally, 4×4 blocks or 8×8 blocks are adopted, which leaves the mechanism for histogram map and histogram count generation unchanged. As a result of this operation, the dimension of the matrix substantially decreases, i.e., down to 140×120 pixels when 4×4 blocks are applied and 70×60 pixels when 8×8 blocks are applied. Nevertheless, the downscale matrix still remains a very sparse matrix. In this case, the representation of FIG. 16 is still valid. Only cell dimension is changing. In the rest of the disclosure, elements of the histogram map matrix are referred as matrix cells. This dimension of these matrix cells may range from 1×1 at full resolution to N×N with N>1 (e.g., 8×8) in the compressed cases.
In the existing literature, three main documents present the latest progresses in the field of location information compression. The first one is the MPEG Reference Model “G. Francini, S. Lepsoy, M. Balestri “Description of Test Model under Consideration for CDVS”, ISO/IEC JTC1/SC29/WG11/N12367, Geneva, November 2011,” which is referred to herein as [RM].
The second one is the MPEG input contribution, “S. Tsai, D. Chen, V. Chandrasekhar, G. Takacs, M. Makar, R. Grzeszczuk, B. Girod, “Improvements to the location coder in the TMuC “, ISO/IEC JTC1/SC29/WG11/M23579672, San Jose, February 2012,” which is referred to herein as [Stanford1]. The third one is the conference paper “S. Tsai, D. Chen, G. Takacs, V. Chandrasekhar, J. Singh, and B. Girod, “Location coding for mobile image retrieval”, International Mobile Multimedia Communications Conference (MobiMedia), September 2009,” which is referred to herein as [Stanford2].
Even though they take different approaches, all of these three papers have the same problem: the coordinates are not represented in full resolution. Rather, the coordinates are in the quantized domain, i.e., at 4×4, 6×6, 8×8 blocks.
The application of block quantization to the histogram map, despite the lossy compression, is able to guarantee limited performance drop in terms of retrieval accuracy. Anyway, when localization of the recognized object in the query image is necessary, e.g., in augmented reality applications, where the object needs to be localized and tracked across a sequence of pictures, applying these quantized blocks causes a significant drop of performances. For example, according to [Stanford1] the localization precision decreases about five percent (5%) when 4×4 blocks are applied at the lowest operating point, and 10% when blocks have 8×8 dimension.
When scaling up to full resolution, the prior art presents some problems. Histogram count compression is quite straightforward; it will not therefore be taken into consideration. The problems that arise for the compression of the histogram map matrix are presented in the following.
The [RM] paper adopts a method aimed at decreasing the sparsity of the matrix by eliminating null rows and columns from the histogram map where no key points appear. One bit is spent for each row and column to indicate whether the full row, or column, is empty. The problem at full resolution is that, with a 480×640 matrix, there is the need of 1120 bits for embedding this information into a compressed bit stream. This is an unacceptable amount of bits resulting in almost 10 bits per key points at the lowest operating point (114 points).
In [Stanford1], a binary entropy coding is adopted over the whole matrix with the following two improvements. Macro-block analysis is applied, i.e., the matrix is subdivided into macro-blocks, referred herein after as skip-Macroblocks, and for each macro-block one bit indicating whether the block is empty is allocated. If the block is fully empty, its elements don't undergo the entropy coding process. Also, a context modeling is applied to the entropy coding and it is based on the cells surrounding the one to be encoded. In particular 10 neighbors are considered, with a resulting number of 45 contexts. In addition to its complexity, in particular for the training phase with 45 context to be generated, this approach cannot effectively be applied to the full resolution case where the matrix is so sparse that it is very rare to encounter non-null cells among the 10 most proximity cells.
According to the [Stanford 2] paper, two methods are applied. A first one is very similar to that presented in the [Stanford 1] paper and presents the same problems. Therefore, it will not be further discussed here. A second one is based on quad-trees. Quad-trees provide a quite effective representation when the matrix is dense, but when the matrix is very sparse, as in the full resolution case, the construction of the tree can be too bit-consuming, resulting in degraded performances.