In recent years, with the advance of imaging elements, the number of pixels in digital still cameras and digital camcorders has been increased. The pixel increase allows a photographer to capture an image of a scenery as viewed by the photographer with a high resolution and in a wide view-field. Such an image captured with a high resolution and in a wide view-filed is called a giga-pixel image having approximately billion pixels, and researches and developments have been conducted for giga-pixel images.
Giga-pixel images have different problems in capturing and in viewing. In capturing, there are problems of blurs by hand shakes, a difficulty in focusing, and the like. However, since a giga-pixel image has a wide field, image capturing by a standing camera or camcorder can prevent these problems.
On the other hand, in viewing, there is a difficulty in finding a desired image from a giga-pixel image due to the huge number of pixels. If the entire image is to be viewed at once, a large display is necessary and therefore a system dedicated to the display is required. On the other hand, if a common display is used to view a giga-pixel image, the image is reduced too much to find a desired object.
Under the above-described circumstances, it has been demanded to detect a desired object from a giga-pixel image having a huge number of pixels in order to automatically find the object, thereby offering convenient viewing.
Various methods have conventionally been conceived to detect an object from an image. For example, one of them is disclosed in Patent Literature 1 to use feature descriptors called Scale-Invariant Feature Transform (SIFT). Another is disclosed in Non-Patent Literature 1 to use Speeded Up Robust Features (SURF).
SIFT is an algorithm for detecting feature points and describing local image features (hereinafter, referred to as “features”). The algorithm is explained briefly. SIFT includes: “detection” for scale and keypoint detection; and “description” for orientation calculation and feature description. First, scale and a keypoint are detected by using Difference Of Gaussian (DoG). Next, an orientation of the keypoint is determined in order to determine features invariant to rotation. Finally, based on the orientation, a feature of the keypoint is described. The following describes the details.
DoG is performed to detect scale and a keypoint. In DoG, two images filtered by respective two Gaussian filters having different dispersion coefficients are calculated, and a difference between the two images is determined. The resulting difference image is called a DoG image. The filtering sing a plurality of dispersion coefficients produces a plurality of DoG images. When output values of DoG images generated by changing a dispersion coefficient is observed for a certain point, the output values have extrema. The extremum has a correlation with an image size, so that when an image size is doubled, the extremum corresponds to an output value of a DoG image generated by doubled dispersion.
As described above, SIFT can describe relative a scale change by an extremum. Therefore, even if an actual size of an object is unknown, scale of the object is determined only by determining (locating) extrema from DoG images. Such feature is called a scale invariant feature that is invariant to image enlargement of reduction. Here, since a plurality of extrema are determined, any extremum that is not effective in recognition is eliminated according to by contrast or a curvature of an intensity value.
Next, the orientation calculation is described. The orientation calculation is processing of determining a gradient strength and a gradient direction of an image from which a keypoint is detected. A weighted histogram is generated based on gradient strengths and gradient directions around the keypoint. For example, a gradient direction is divided into 36 directions to structure the histogram. Here, direction components having 80% or more of a maximum value are allocated to an orientation of the keypoint.
A feature is described by using gradient information around the keypoint. First, in order to set a region around the keypoint to a coordinate axis in the direction of the keypoint, the region around the keypoint is rotated so that the keypoint direction is changed to a vertical direction. The rotated region is divided into 16 blocks having 4×4 blocks, and a gradient direction histogram in 8 directions (each 45 degrees) is generated for each of the blocks. Since a feature is described for a region in which the coordinate axis is set to the orientation direction, the feature is invariant to rotation.
As described above, there has been the method of determining SIFT descriptors from an image. However, the method has a problem of a large amount of DoG calculation. In order to address the problem, SURF features have been conceived.
Regarding SURF descriptors, extrema are not determined from DoG images, but determined by using fast-hessian detectors. A fast-hessian detector is generated by approximating a hessian matrix by a rectangle.
A determinant is determined by using hessian matrixes having different resolutions. Here, an integral image is used to determine the hessian matrixes. Therefore, a sum of various-sized rectangles is calculated by addition and subtraction of pixels at four points.
The integral image is described. For the sake of easy understanding, the integral image is explained as a one-dimensional image. It is assumed that a one-dimensional image is stored as shown in FIG. 27A. Under the assumption, an integral image is generated as described below. As shown in FIG. 27B, a pixel value of a certain pixel i in the integral image is calculated by adding a pixel value of the i-th pixel to a sum of pixel values of the first to (i−1)th pixels. A sum of the second to sixth elements is a difference between the sixth element and the first element in the integral image (FIG. 27C).
As described above, SURF differs from SIFT in that any DoG images are not required. Therefore, SURF is faster than SIFT. However, SURF needs to store a sum of pixel values (namely, an integral image), so that a large memory capacity is necessary. For example, the following describes the situation where an integral image in which a plurality of images are located in one dimension is generated. If a pixel value of one pixel is 1 byte, the second element requires 2 bytes and the third element requires 3 bytes in the integral image. As described above, a required memory capacity is gradually increased as the number of pixels is increased. In consideration of a giga-pixel image, the 1G-th element requires a memory amount of 1G bytes. Furthermore, if the same goes for every element, a memory capacity of 1G bytes×1G is eventually required. Therefore, application of SURF to giga-pixel images is not realistic.
In order to address the above problem, methods with a smaller memory amount and a smaller processing amount have been disclosed in Patent Literatures 2, 3, and 4, and Non-Patent Literature 2. By these methods, an intensity gradient is determined from an image, and based on the gradient, an object is detected from an image for recognition (hereinafter, referred to as a “recognition image”).