1. Field of the Invention
The present invention relates to a local feature amount calculating device and a method of calculating a local feature amount, which are used for calculating a local feature amount of an image, and a corresponding point searching apparatus and a method of searching a corresponding point using a local feature amount calculating device.
2. Description of the Related Art
Conventionally, in the field of image processing, a corresponding point searching process for searching for corresponding points of two images is frequently used. FIG. 22 is a diagram illustrating corresponding points of two images. By searching for corresponding points of a plurality of images, it is possible to search an image database for an image corresponding to an input image or reconstruct the three-dimensional shape of a subject by associating a plurality of images of the subject photographed at different time points.
In order to search for corresponding points of two images, local feature amounts can be used. The local feature amount is an amount that features the texture pattern of the peripheral area of a pixel of interest and is represented in a D-dimensional vector. Thus, pixels that have peripheral texture patterns similar to each other have mutually similar local feature amounts. Accordingly, by acquiring the local feature amounts of the pixels of, for example, first and second images and comparing the local feature amounts with each other, it can be determined whether or not the pixels of the first image and the pixels of the second image are corresponding pixels (corresponding points).
FIG. 23 is a diagram illustrating an image search using local feature amounts. By performing a corresponding point search using local feature amounts, an image search can be performed as below (for example, see “Improving bag-of-features for large scale image search”, International Journal of computer vision, 2010). The object of an image search here is to search an image database for an image corresponding to an input image. First, feature points are extracted from each one of a plurality of images registered in the image database, and the local feature amounts of the feature points are acquired. When an image (input image) as a search target is input, feature points are extracted from the input image, and the local feature amounts of the feature points are acquired. Then, one of the plurality of the feature points of the input image is set as a feature point of interest, and a local feature amount that is the closest to the local feature amount of the feature point of interest is searched from the image database. Then, one vote is given to an image to which the retrieved local feature amount belongs. This process is repeated for all the feature points included in the input image, and an image having the most votes is set as a search result.
In addition, by performing the corresponding point search using local feature amounts, as below, a three-dimensional shape can be reconstructed from a plurality of images photographed at different view points (for example, see “Unsupervised 3D object recognition and reconstruction in unordered datasets,” International Conference on 3-D Digital Imaging and Modeling (3DIM 2005)). First, two images are selected from a plurality of images acquired by photographing a photographing subject at a plurality of photographing positions, and corresponding points of the two images are acquired. This corresponding point search is performed for all combinations of the plurality of images. Next, a photographing position parameter of each image and a shape parameter of the photographing subject are acquired through bundle adjustment by using the corresponding point information as a clue.
As representative techniques for acquiring the local characteristic amounts, a scale invariant feature transform (SIFT) (for example, see “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, 60, 2 (2004) or U.S. Pat. No. 6,711,293) and speeded up robust features (SURF) (for example, see Herbert Bay, Andreas Ess, Tinne Tuytelaars, Luc Van Gool, “SURF: Speeded Up Robust Features”, Computer Vision and Image Understanding (CVIU), Vol. 110, No. 3, pp. 346-359, 2008) are known. According to the SIFT or the SURF, even in a case where an image is rotated, or there is a difference in the scales of images, local feature amounts can be calculated which are invariant regardless of (unaffected by) such a change. Accordingly, when corresponding points of two images are searched for, even in a case where an image is rotated, or there is a difference in the scales of the images, corresponding points of the images can be appropriately acquired.
Hereinafter, a technique for calculating local feature amounts will be described with reference to an example according to the SIFT. FIG. 24 is a flow diagram illustrating a process of calculating local feature amounts in accordance with the SIFT. In addition, FIGS. 25A to 27 are diagrams illustrating the process of calculating local feature amounts according to the SIFT. In the process of calculating local feature amounts, first, as illustrated in FIG. 25A, a feature point p and a near-field region NR set by using the feature point as its center are extracted from an image TP in Step S71. In a case where there is a plurality of feature points p, the near-field region NR is set for each feature point p. In the process of extracting the feature point p and the near-field region NR, the scale information of the image is output together with the position of the feature point p by a difference of Gaussians (DoG) filter. By cutting out the near-field region NR that has the feature point p as its center in accordance with the scale information, the scale invariance is realized.
Next, the direction of the main axis is acquired in Step S72. In the process of calculating the main axis, the following process is performed for each near-field region NR that is set for each feature point p. First, as illustrated in FIGS. 25C and 25D, for each pixel located inside the near-field region NR, differential values for x and y directions are calculated, and an edge strength m(x, y) and an edge gradient direction θ(x, y) are acquired. Here, the edge strength m(x, y) is weighted in accordance with a Gaussian window G(x, y, σ) (see FIG. 25B) having the feature point p as its center so as to acquire a weighted edge strength mhat(x, y). Accordingly, as a pixel is located closer to the center of the near-field region NR, the pixel is regarded to include more important information.
Next, a histogram of edge gradient directions is generated. Described in detail, the edge gradient direction of each pixel is quantized into 36 kinds, and a vote of the weighted edge strength mhat(x, y) thereof is given to the corresponding direction. By quantizing the edge gradient directions and giving votes of the weighted edge strengths for all the pixels included in the near-field region NR as above, a histogram of the edge gradient directions as illustrated in FIG. 25E is acquired. Next, a maximum value is detected from the histogram of the gradient directions, a two-dimensional function is fitted by using values located on left and right sides in the direction having the maximum value, and the direction corresponding to a local maximum is set as the main axis direction v.
Next, rotation correction is made for the texture pattern located inside the near-field region in Step S73. In the rotation correction, the texture pattern located inside the near-field region NR is rotated such that the main axis direction v acquired in Step S72 coincides with a reference direction RD. FIG. 26 illustrates an example in which the reference direction is a horizontally rightward direction. The value of each pixel after rotating the near-field region NR is acquired through linear interpolation of the peripheral pixels. For the rotated texture pattern, a new near-field region NR′ is set.
Next, a local feature amount is calculated in Step S74. In order to calculate the local feature amount, first, for each pixel located within the near-field region NR′ that is newly set for the rotation-corrected texture pattern, differential values for the x and y directions are calculated again, and, as illustrated in FIG. 27A, the edge gradient directions and the edge gradient strengths are acquired. In FIGS. 27A and 27B, the direction of each arrow illustrates the edge gradient direction, and the length of the arrow illustrates the edge strength.
Next, the edge gradient directions of each pixel are quantized into eight kinds, and, as illustrated in FIG. 27B, the near-field region is divided into 4×4, thereby defining 16 vote cells VC are defined. In the example illustrated in FIG. 27B, one vote cell VC is formed by 2×2 pixels. Then, for each vote cell VC, a histogram of the gradient directions of eight kinds is acquired. Through the calculation described above, a feature vector of 16×8=128 dimensions is acquired. By normalizing the length of the feature vector as one, the local feature amount is acquired.
As described above, the local feature amount that has the rotation invariance and the scale invariance can be acquired. Although the description presented above is the process of calculating the local feature amount in accordance with the SIFT, the SURF is based on a concept similar thereto.
According to the SIFT, there is a problem in that the speed at which the local feature amount is calculated is slow. In addition, although the SURF is a technique for improving the calculation speed of the SIFT, sufficient performance is not acquired when it is operated by hardware of a low specification such as a mobile terminal. The reasons that cause the slow calculation speed are considered to be the following two factors.
The first reason is that, in the local feature amount calculating process according to the SIFT or the SURF, a differential value is calculated for each pixel included within the near-field region NR so as to acquire the main axis direction v, rotation correction is made for the texture pattern based on the acquired main axis direction v, and, for each pixel included within a new near-field region NR′ after the rotation correction, a differential value is calculated again so as to calculate a local feature amount. In other words, although the differential value calculation performed for acquiring the main axis and the differential value calculation for acquiring the local feature amount are processes similar to each other, the processes are independently performed.
The second reason is that the amount of calculation required for performing the rotation correction of the texture pattern is enormous. In the calculation required for the rotation correction of the texture pattern, a linear interpolation process in units of sub pixels is necessary, and a large amount of floating-point calculation is accompanied, whereby the amount of calculation is large.