Visual Search (VS) is referred to as a capability of an automated system to identify an object or objects depicted in an image or in a sequence of images by only analyzing the visual aspects of the image or the sequence of images, without exploiting any external data, such as textual description, metadata, etc. Augmented Reality (AR) can be considered an advanced usage of VS. After objects depicted in an image or in an sequence of images have been identified, additional content (e.g. synthetic objects) is superimposed to the real scene represented by the image or the sequence of the images, thus ‘augmenting’ the real content; the position of the additional content is consistent to the one of the real objects.
The predominant method of VS relies on determining so called local features which are referred to as descriptors in literature and also hereinafter. The most famous methods are Scale-Invariant Feature Transforms (SIFT) as described by D. Lowe in “Distinctive Image Features from Scale-Invariant Keypoints, Int. Journal of Computer Vision 60 (2) (2004) 91-110. H” and Speeded Up Robust Features (SURF) as described by Bay, T. Tuytelaars, L. V. Gool in “SURF: Speeded Up Robust Features, in: Proceedings of European Conference on Computer Vision (ECCV), Graz, Austria, 2006, http://www.vision.ee.ethz.ch/˜surf/”. In literature it is possible to find many variations of those technologies, which can be considered improvements of those two original technologies.
As can be seen from FIG. 7, a local feature is a compact description, e.g. 128 Bytes for each local feature in SIFT, of a patch 703 surrounding a key point 705 in an image 701. FIG. 7 shows an example of extraction (upper part of FIG. 7) and representation (lower part of FIG. 7) of local features. In the upper part of FIG. 7 the position of the point where the local feature is computed is indicated by a circle representing the point 705 in the image 701, the circle being surrounded by a square representing the oriented patch 703. In the lower part of FIG. 7 a grid 709 subdivision of the patch 703 contains histogram components 711 of the local feature. In order to compute a local feature, a main orientation 707 of the point 705 is computed based on the main gradient component in the point's 705 surrounding. Starting from this orientation 707, a patch 703 oriented towards the main orientation 707 is extracted. This patch 703 is then subdivided into a rectangular or radial grid 709. For each element of the grid 709, a histogram 711 of the local gradients is computed. The histograms 711 computed for the grid 709 elements represent the components of the local feature. Characteristic of such descriptor 713 containing the histograms 711 of the grid 709 elements as illustrated in the lower part of FIG. 7 is to be invariant to rotation, illumination, and perspective distortions.
In an image 701, the points 705 upon which local features 713 are computed identify distinct elements of the scene, e.g. corners, specific patterns, etc. Such points are normally called key points 705, also referred to as points of interest 705. The circles depicted in the upper part of FIG. 7 show exemplary key points 705. The x/y position in the image of the key points 705 will be referred hereinafter as location information of the local feature.
MPEG is currently defining a new part of MPEG-7 (ISO/IEC 15938-Multimedia content description interface), part 13, Compact Descriptors for Visual Search (CDVS) dedicated to the development of a standard for Visual Search. The standard aims at defining a normative way to compress the amount of information enabling Visual Search, in order to minimize network delay and overall bitrate. In particular, the technology being standardized encompasses a compression mechanism for two kinds of information related to individual key points 705, hereinafter referred as feature information, on one hand the content information, i.e. the local feature or descriptor providing a compact descriptor of the patch 703 surrounding the key point 705, and on the other hand the location information, i.e. the position of the key point 705.
In the CDVS standardization process, six operating points have been defined for testing purposes. The operating points which are hereinafter referred to as bitrate have the following numbers of Bytes per image: 512, 1024, 2048, 4096, 8192 and 16384. Each operating point indicates the total bitrate used to represent all the local features and their location information extracted from an image. This means that, according to the bitrate, only a limited number of local features can be encoded. This number is spanning between 114 local features at the lowest operating point of 512 Bytes to 970 local features at the highest operating point of 16384 Bytes.
The standardization process has currently reached the Core Experiments phase realizing a reference implementation on top of a Reference Model (RM).
The RM location information compression method as described by Tsai et al. in “Location Coding for Mobile Image Retrieval” at Mobimedia 2009 and as defined by the standardization in “Test Model of Compact Descriptor for Visual Search (MPEG doc w13145) in October 2012” works as described in the following. In the first step, the key point coordinates, originally computed in floating points values, are downscaled to certain resolution, e.g. VGA in the standard, and rounded to integer values in the new resolution. After this step, the location information can be represented as a very sparse matrix, as can be seen from FIG. 8. In the second step, a spatial grid with pre-defined block size is superimposed to the matrix and histograms of occurrences of non-null values into each block are computed as can be seen from FIG. 8. From this representation, two different kinds of information are encoded. The first one is a Histogram map which represents binary information about the presence or non-presence of key points in each block. A second one is a Histogram count that represents a number of occurrences in each non-null block.
The key point coordinates are represented in floating points values in the original non-scaled image resolution. Since the first operation applied to every image is the downscale to VGA resolution, the key point coordinates are rounded to integer values in VGA resolution. Therefore, it might happen that several points are rounded to the same coordinates. It is also possible to have two descriptors computed exactly on the same key point with two different orientations. This first rounding has negligible impact on the retrieval performances.
FIG. 8 depicts an example of such a rounding operation, where each square block 803, corresponds to a 1×1 pixel cell at full resolution. An image 800 can be created, where non-null pixels correspond to the position of the key points, and then partitioned into a block representation 801 which can be represented by a matrix representation 802. Values of these square blocks 803, 805, e.g. 2 for the first square block 803 and 1 for the second square block 805 as depicted in FIG. 8, are represented in a matrix 802, where non-null elements 807, 809 represent key points' position, e.g. a first non-null element 807 corresponding to the first block 803 and a second non-null element 809 corresponding to the second block cell 805. Consequently, the problem can be reformulated as the need to compress a matrix 802 of 640×480 elements, with the characteristic of being extremely sparse, i.e. having less than 1000 non-null cells, even at the highest operating point. For compressing this matrix, there is the need to represent two different kinds of information, which are a histogram map (hereby also referred as map of location information), that is, a binary map of empty and non-empty cells, and a histogram count, a vector containing the number of occurrences in each non-null cell. The histogram map is represented by the binary format of the block representation 801 depicted in FIG. 8 and the histogram count is represented by the vector created by the non-null elements of matrix representation 802 depicted in FIG. 8. For improving compression efficiency, in literature, these two elements are always encoded separately.
The Histogram count, in the RM, is encoded through plain single model arithmetic coding. The Histogram map adopts the so called sum-based arithmetic coding: each element is encoded through a context based arithmetic coding, the context being given by the number of non-null elements occurring in spatial proximity of the element to be encoded. Normally, rectangular regions are adopted to compute the context. This approach aims at exploiting the tendency of the local features to concentrate in certain regions. The context changes according to the block size because this causes different features concentration and according to the bitrate because different number of features is encoded for different bitrates. As a context-based arithmetic coding, the sum-based context requires training on specific training datasets.
The described prior art has two problems, namely memory allocation and need for training.
With respect to memory allocation, CDVS standardization addressed very memory constrained environments, i.e. should be implementable using memory tables of a memory size of smaller than 128 KB, in order to improve, for example, the hardware implementation on mobile devices. In the RM, the size of the rectangle for sum-based context is 55 elements, i.e. 5 by 11. Therefore, the context used by the sum-based arithmetic coding can assume 56 values, i.e. values from 0 to 55. Besides, the RM model adopts a circular scanning of the histogram map elements, starting from the center and going to the sides of the matrix. Therefore, the central region where a rectangle of 55 elements has not been encoded already is encoded without context, just adopting single model arithmetic context. This probability value needs also to be signaled with a total of 57 elements to be signaled to optimally encode the histogram map at a certain block size and bitrate. The combination of block size and bitrate will be referred as testing point hereinafter. Considering that each context value is stored using 4 bytes and for each testing point, i.e. bitrate at a certain block size, 57 (context dimension)*4 (bytes per context value)*2 (0 and 1 probabilities) bytes are allocated, this results in a potential significant amount of memory required.
With respect to the need for training, in the method adopted by the RM, each testing point, i.e. bitrate at a certain block size, needs to be trained. Unless the full context for each testing point is stored on specific tables, thus resulting in large tables, the encoder and the corresponding decoder need to be trained on the same training dataset to provide exactly the same results thereby representing a problem for guaranteeing interoperability between encoders and decoders of different manufacturers or service providers.