In recent years, as digital cameras are increasingly widespread and sophisticated, the digital cameras and devices using them are increasingly receiving attention as new information devices. In addition, the increase in the memory capacity of hard disks allows individual people to possess a large amount of image data. Accordingly, researches dealing with a large number of digital images or moving images are conducted actively. As a field of such researches, there is research on recognition of three-dimensional objects included in images.
The techniques of recognizing three-dimensional objects included in images can be classified into a technique that generally recognizes the class of objects and a technique that recognizes the instance. The former returns the class of objects, such as a chair and an automobile, as the result, whereas the latter identifies the instance such as a specific model of an automobile. The present invention will focus on the latter, i.e., the identification of the instance, and description will be made in relation thereto. Particularly, the present invention will focus on a three-dimensional object recognition, which uses local descriptors, for example based on a SIFT (Scale-Invariant Feature Transform)(e.g., see Non-Patent Literature 1). In the conventional techniques, there is a technique which constructs a three-dimensional surface model of an object through matching of local descriptors, based on images of an object shot from various angles, so as to be used for recognition (e.g., see Non-Patent Literatures 2 and 3). In addition, there is a technique that uses local descriptors extracted from an image for construction of a model to be matched with unknown images, without using a three-dimensional model (e.g., see Non-Patent Literatures 4 and 5). The present invention relates to the latter approach.
As a simplest technique using such an approach, there is a technique in which a large number of local descriptors are extracted from images of an object shot under various conditions, and are stored for constructing a model. Advantageously, this simple approach can easily realize highly accurate recognition. However, since a huge number of the local descriptors will be obtained, there are problems in that it takes immense time to perform local descriptor matching, and in that it is difficult to perform a large-scale object recognition since a large memory is required for recognition.
As to the former problem, it is indispensable to improve the efficiency in the nearest neighbor searching of local descriptors. Thus, in order to solve this problem, there is a technique using approximate nearest neighbor searching of local descriptors. According to Noguchi et al., it is reported that with introduction of this technique into the object recognition, it is possible to realize a high-speed, highly accurate object recognition. (e.g., see Non-Patent Literature 6, and Patent Literature 1).
On the other hand, as to the latter problem, since the memory size of models (memory required for models) constitutes a large proportion of the memory required for recognition, reduction in the memory size of models is a main problem.
Meanwhile, of the three-dimensional object recognition techniques using local descriptors, such techniques that do not construct three-dimensional models of objects are advantageous, since with shot images of an object, it is possible to simply construct its model by extracting local descriptors therefrom. In order to achieve accuracy in the three-dimensional object recognition using such simple techniques, a large number of images shot under various conditions are required for constructing a model. Generally, since several dozen to several thousand local descriptors are extracted from one image, an extremely large number of local descriptors will be involved in modeling of an object, and how to deal with such local descriptors will be the main subject.
Most of the conventional techniques employ a method of vector-quantization of local descriptors so as to be replaced by representation vectors, which are called visual words. In the case of recognizing an unknown image, local descriptors obtained from the image are replaced by the visual words so as to be matched. In the case of identification of the instance of an object, it is known that, particularly, the more the number of the visual words is increased, the more the recognition rate will be improved, although the improvement depends on the recognition target. For example, Nister et al. reported an example using 16 million visual words (see Non-Patent Literature 4). In the case of using a large number of visual words, the calculation time required for matching between the local descriptors and the visual words is unignorable, and thus speeding-up by using various data structures such as a tree structure is necessary (see Non-Patent Literatures 4 and 5).
Among the techniques using such a large number of visual words, a technique of using all “cases” of the local descriptors without using vector quantization is the most extreme one. With this approach, although high recognition rate can be expected, a problem will occur in that a huge memory will be required for model recording.
The simplest one of the recognition techniques may be such a technique in which a label indicating an object is added to a large number of local descriptors, which correspond to the above cases, and based on matching with those local descriptors which are obtained from unknown images, votes are cast for the label indicating the object. Normally, the matching is performed using the nearest neighbor searching. In such a process, since it is only necessary to assign a correct label to each local descriptor obtained from unknown images, it is not necessary to record all the local descriptors. Here, “voting” is processing used for partially counting up evidences in the field of information processing, and is processing in which: based on each of the obtained evidences, a score is given to one of choices; and the choice that has obtained a top score, as a result of counting up scores based on all the evidences, is to be selected. Generally, the score for voting varies depending on the evidences.
As a method of eliminating unnecessary local descriptors while guaranteeing the same effect as that in the case of recording all the local descriptors, a method called condensing is proposed. For example, Wada et al. proposed a technique that is also efficiently applicable to a higher-dimensional space (e.g., see Non-Patent Literature 7).