1. Field of the Invention
The present invention relates to the field of the image analysis.
2. Description of the Related Art
In the field of the image analysis, a common operation provides for comparing two images in order to find the relation occurring therebetween in case both the images include at least a portion of a same scene or of a same object.
Among a high number of applications, the image comparison is of the utmost importance for calibrating video cameras belonging to a multi-camera system, for assessing the motion occurring between two frames of a video shoot, and for the recognition of an object within an image (e.g., a picture). The latter application is now assuming more and more importance due to the recent development of object recognition algorithms specifically designed to be employed in the so-called visual searching engines, i.e., automated services that, starting from a picture, are capable of identifying the object(s) pictured therein and offering information related to the identified object(s). Examples of known services of this type include Google Goggles, Nokia Point&Find, and kooaba Smart Visuals. An object recognition application typically provides for comparing a first image—in jargon, referred to as “query image”—depicting an object to be recognized with a plurality of model images, each one depicting a respective known object; this allows to perform a comparison among the object depicted in the query image and the objects depicted in the model images.
The model images are typically arranged in a proper model database. For example, in case the object recognition is exploited in an online shopping scenario, each model image corresponds to an item offered by an online store (e.g., the picture of a book cover, a DVD cover and/or a CD cover). The number of model images included in a database of such type is quite high; for example, a model database of an online shopping service may include several millions of different model images.
A very efficient way for performing comparing operations between two images provides for selecting a set of points—in jargon, referred to as keypoints—in the first image and then matching each keypoint of the set to a corresponding keypoint in the second image. The selection of which point of the first image has to become a keypoint is advantageously carried out by extracting local features of the area of the image surrounding the point itself, such as for example the point extraction scale, the privileged orientation of the area, and the so called “descriptor”. In the field of the image analysis, a descriptor of a keypoint is a mathematic operator describing the luminance gradient of an area of the image (called patch) centered at the keypoint, with such patch that is orientated according to the main luminance gradient of the patch itself.
In “Distinctive image features from scale-invariant keypoints” by David G. Lowe, International Journal of computer vision, 2004, a Scale-Invariant Feature Transform (SIFT) descriptor has been proposed; briefly, in order to allow a reliable image recognition, the SIFT descriptors are generated taking into account that the local features extracted from the image corresponding to each keypoint should be detectable even under changes in image scale, noise and illumination. The SIFT descriptors are thus invariant to uniform scaling, orientation, and partially invariant to affine distortion and illumination changes.
The SIFT descriptor is a quite powerful tool, which allows to select keypoints for performing accurate image comparisons. However, this accuracy can be achieved only with the use of a quite large amount of data; for example, a typical SIFT descriptor is an array of 128 data bytes. Since the number of keypoints in each image is relatively high (for example, 1000-1500 keypoints for a standard VGA picture), and since each keypoint is associated with a corresponding SIFT descriptor, the overall amount of data to be processed may become excessive for being efficiently managed.
This drawback is exacerbated in case the scenario involves the use of mobile terminals (e.g., identification of objects extracted from pictures taken by the camera of a smarthpone). Indeed, since the operations to be performed for carrying out the image analysis are quite complex and demanding in terms of computational load, in this case most of the operations are usually performed at the server side; in order to have all the information required to perform the analysis, the server needs to receive from the mobile terminal all the required data, including the SIFT descriptors for all the keypoints. Thus, the amount of data to be transmitted from the terminal to the server may become excessive for guaranteeing a good efficiency of the service.
According to a solution known in the art, such as for example the one employed by Google Goggles, this drawback is solved at the root by directly transmitting the image, and not the descriptors, from the mobile terminal to the server. Indeed, because of the quite high number of keypoints, the amount of data of the corresponding SIFT descriptors may exceed the size (in terms of bytes) of a standard VGA picture itself.