Local descriptors in SIFT (Scale-Invariant Feature Transform) or the like can realize object recognition that is relatively robust to occlusion or variation of a lighting condition, and thus, currently, the local descriptors attract attention (e.g., see Non-Patent Literatures 1 and 2). A model called “Bag of Words” or “Bag of Features” is basically used for recognition. In this model, locations or co-occurrences of the local descriptors are not considered. Only the frequency of occurrences of the local descriptors is used for recognizing an object.
Here, the local descriptors represent local features of an image. The local descriptors are extracted through a predetermined procedure so as to have characteristics that are robust to variation (geometric transformation, lighting conditions, or variation of resolutions) of an image. In addition, because the local descriptors are determined from a local area of an image, the local descriptors are robust also to occlusion. In the present specification, the local descriptors are also referred to as feature vectors because the local descriptors are represented as vectors.
In general, the number of local descriptors extracted from an image is several hundreds to several thousands, or sometimes reaches several tens of thousands. Therefore, an enormous amount of processing time is needed for matching of the local descriptors, and an enormous amount of memory is needed for storing the local descriptors. Therefore, the important research subject is how to reduce the amount of processing time and the amount of memory while keeping a recognition accuracy at a certain level.
For example, in the SIFT, as typical local descriptors, each local descriptor is represented as a 128-dimensional vector. In addition, there is known a PCA-SIFT that uses a vector having reduced dimension from that of the SIFT by performing principal component analysis. However, an example of local descriptors used in a practical PCA-SIFT is 36-dimensional vectors. Moreover, a general data type for representing the value of each dimension is a 32-bit float type or integer type which is applied to general numerical representations. When a higher accuracy is needed, a 64-bit double type is used. On the other hand, when a limited range of values are used or when it is desired to reduce the amount of memory even while sacrificing the accuracy, a 16-bit short integer type can be specially used. Even in the PCA-SIFT using a 36-dimensional vector and specially using the short integer type to prioritize reduction of the amount of data, each local descriptor needs a memory of 16 bits×36 dimensions=512 bits (64 bytes).
In general, nearest neighbor searching calculates the distance between vectors and determines the nearest local descriptor. It has been commonly considered that if an accuracy of data of each dimension is decreased, accurate nearest neighbor searching cannot be performed, and therefore, an accuracy (recognition rate) of recognition of an image is decreased.
Accordingly, many conventional techniques employ the following approach. Local descriptors obtained from an image for constructing a model are vector-quantized (a technique of classifying local descriptors into a predetermined number of groups such that each group includes similar local descriptors, and then expressing each local descriptor included in the same group by a representative value thereof, i.e., clustering), several thousand to several hundred thousand visual words (which correspond to the above representative values) are determined, and an image is described by using the visual words (e.g., see Non-Patent Literature 3). Upon recognition of an unknown image, local descriptors obtained from the image are converted into visual words, and the frequency and the like are measured. In such an approach, if the number of visual words is sufficiently small, high-speed processing can be expected. On the other hand, it is pointed out that, if the number of visual words is large, a sufficient recognition rate cannot be attained (e.g., see Non-Patent Literature 4). The larger the number of visual words is, the more difficult it is to ignore calculation time needed for vector quantization. In addition, a problem arises with respect to the amount of memory for storing the visual words.
The above advantage and problem are the most prominent in an extreme case, that is, when individual local descriptors obtained from an image for constructing a model are directly converted into visual words. For example, about two thousand local descriptors are extracted from a general VGA-size image. Therefore, when hundred thousand VGA-size images are used for constructing a model, the number of visual words is two hundred millions, and enormous amount of calculation resources are needed for matching and storage. Meanwhile, when a large number of local descriptors are used for a model, highly accurate recognition can be realized.
One of solutions to the problem of processing time is to introduce “approximate nearest neighbor searching” in matching of local descriptors (e.g., see Non-Patent Literature 5 and Patent Literature 1). It is known that for example, when a recognition task of the above magnitude is to be performed, the “approximate nearest neighbor searching” enables the processing time to be smaller than 10−6 times the processing time taken for simply performing matching of all local descriptors, without almost any decreasing of the recognition rate. On the other hand, one of solutions to the problem of the amount of memory is to performing vector quantization more roughly. However, this solution is not necessarily preferable because the recognition rate decreases.
Citation List
Patent Literature
Patent Literature 1: International Publication WO2008/026414
Non-Patent Literature
Non-Patent Literature 1: D. Lowe, “Distinctive image features from scale-invariant keypoints”, International Journal of Computer Vision, vol. 60, no.2, pp. 91-110, 2004
Non-Patent Literature 2: J. Ponce, M. Hebert, C. Schmid, and A. Zisserman Eds., Toward Category-Level Object Recognition, Springer, 2006
Non-Patent Literature 3: J. Sivic and A. Zisserman, Video google: A text retrieval approach to object matching in videos, Proc. ICCV2003, Vol. 2, pp. 1470-1477, 2003
Non-Patent Literature 4: D. Nister and H. Stewenius, Scalable recognition with a vocabulary tree, Proc. CVPR2006, pp. 775-781, 2006
Non-Patent Literature 5: Kazuto Noguchi, koichi Kise, Masakazu Iwamura, “Efficient Recognition of Objects by Cascading Approximate Nearest Neighbor Searchers”, Meeting on image recognition and understanding (MIRU 2007) Collection of papers, pp. 111-118, July, 2007