A technology that detects a position of a person shown in an image is expected to be used in a variety of applications, such as video monitoring systems, vehicle driving support systems and automatic annotation systems for images and video, and such technology has been subject to extensive research and development in recent years.
In a scanning frame search type of detection method, an input image is finely raster-scanned using a variable-size rectangular scanning frame, an image feature within the scanned scanning frame is extracted, and it is determined whether or not a target object is shown in the scanning frame using a discriminator that has learned separately offline. Depending on the input image size, the number of scans per image ranges from tens of thousands to hundreds of thousands, and therefore a feature amount and the discriminator's processing computation amount greatly affects the detection processing speed. Consequently, selection of a low-cost feature amount effective for discrimination of a target object is an important factor affecting detection performance, and various feature amounts have been proposed for individual detection target objects, such as faces, people, and vehicles.
Generally, a sliding window method is widely used as an object detection method (see Non-Patent Literature 1 and Patent Literature 1, for example). In a sliding window method, an input image is finely raster-scanned using a rectangular scanning frame (window) of a prescribed size, an image feature is extracted from an image within each scanned window, and it is determined whether or not a person is shown in a target window. Objects of various sizes are detected by enlarging or reducing a window or input image by a predetermined ratio. A feature amount is extracted from each scanned window, and based on an extracted feature amount it is determined whether or not this is a detection target object. The above description refers to a still image, but the situation is similar for moving image processing using feature amounts in preceding and succeeding frames in the time domain, for instance, as in Non-Patent Literature 2.
One important factor affecting detection accuracy is a feature amount used in determining whether or not an object is a person, and various feature amounts have hitherto been proposed. A typical feature amount is a histogram of oriented gradients (hereinafter referred to as “HOG”) feature amount proposed by Dalal et al. in Non-Patent Literature 1. An HOG is a feature amount obtained by dividing a window image of a prescribed size into small areas and creating a histogram of edge direction values within a local area. An HOG captures a silhouette of a person by using edge direction information and has an effect of permitting local geometric changes by extracting a histogram feature for each small area, and shows that excellent detection performance is achieved even for an INRIA data set that includes various attitudes (described in Non-Patent Literature 1).
Patent Literature 1 is proposed as an improvement on the method in Non-Patent Literature 1. In Non-Patent Literature 1, an input window image is divided into small areas of a fixed size and an edge direction histogram is created from each of those small areas, whereas in Patent Literature 1 a method is proposed whereby various feature amounts are provided by making the small area size variable, and furthermore an optimal feature amount combination for discrimination is selected by means of boosting.
There is also Non-Patent Literature 3 as an improvement on the method in Non-Patent Literature 1. In Non-Patent Literature 1, edge directions are quantized into eight or nine directions, and an edge direction histogram is created for each angle. In Non-Patent Literature 3, in addition to an edge direction value of each pixel, co-occurrence histograms of oriented gradients (hereinafter referred to as “coHOG”) features are proposed in which an edge direction combination between two pixels is improved so as also to create a histogram for each 30-offset positional relationship.
FIG. 1 is a drawing explaining an HOG feature amount and coHOG feature amount. FIG. 1A shows an input image that is a scanning frame image, FIG. 1B shows an edge image, and FIG. 1C shows edge gradient histogram features.
An HOG and coHOG both extract a feature amount from an edge image calculated from brightness I of an input image. An edge image comprises edge gradient θ and edge magnitude mag, and is found by means of equations 1 below.
                    [        1        ]                                                                                                d              x                        ⁡                          (                              x                ,                y                            )                                =                                    I              ⁡                              (                                                      x                    +                    1                                    ,                  y                                )                                      -                          I              ⁡                              (                                                      x                    -                    1                                    ,                  y                                )                                                    ⁢                                  ⁢                                            d              y                        ⁡                          (                              x                ,                y                            )                                =                                    I              ⁡                              (                                  x                  ,                                      y                    +                    1                                                  )                                      -                          I              ⁡                              (                                  x                  ,                                      y                    -                    1                                                  )                                                    ⁢                                  ⁢                              mag            ⁡                          (                              x                ,                y                            )                                =                                                                                          d                    x                                    ⁡                                      (                                          x                      ,                      y                                        )                                                  2                            +                                                                    d                    y                                    ⁡                                      (                                          x                      ,                      y                                        )                                                  2                                                    ⁢                                  ⁢                              θ            ⁡                          (                              x                ,                y                            )                                =                                    tan                              -                1                                      ⁢                                                            ⅆ                  y                                ⁢                                  (                                      x                    ,                    y                                    )                                                                              ⅆ                  x                                ⁢                                  (                                      x                    ,                    y                                    )                                                                                        (                  Equations          ⁢                                          ⁢          1                )            
An edge image found in this way is divided into predetermined B small areas, and edge gradient histogram Fb is found for each small area. Elements of gradient histograms of each small area are taken as respective feature dimensions, and multidimensional feature vectors linking all these are taken as a feature amount and F. Edge gradient histogram Fb is shown by equations 2 below.[2]F={F0,F1, . . . ,FB-1}Fb={f0,f1, . . . , fD-1} bε[0,B=1]  (Equations 2)
With an HOG, edge gradient values converted to 0 to 180 degrees are divided into nine directions and quantized, and a gradient histogram is calculated with an edge magnitude value as a weight. With a coHOG, edge gradient values of 0 to 360 degrees are divided into eight directions and quantized, and a histogram is calculated for each combination of gradient values of offset pixels of 30 surrounding points with each pixel within a local area as a reference point pixel. With a coHOG, an edge magnitude value is used for edge noise removal, and for pixels for which an edge magnitude value is greater than or equal to a threshold value, a number of events is counted for each gradient direction and for each gradient direction combination.
FIG. 2 is a drawing showing a conventional feature amount calculation method represented by Non-Patent Literature 1, Patent Literature 1, and Non-Patent Literature 3.
As shown in FIG. 2, feature amount calculation apparatus 10 is provided with feature value calculation section 12, histogram feature configuration section 13, discriminant function 14, and determination section 15.
When image data 11 (see FIG. 2a) is provided as input, feature value calculation section 12 first divides image data 11 into small areas (see FIG. 2b), and extracts edge data (see FIGS. 2c and 2d). FIG. 2 shows an example in which feature value calculation section 12 focuses attention on small area k of the thick-frame part in FIG. 2c, and calculates a small area k edge magnitude value and edge direction value. As edge data, an edge direction value (0 to 180 degrees or 0 to 360 degrees) is divided by Q, and values quantized into Q directions are used. A value of 8 or 9 is generally set for Q.
Next, histogram feature configuration section 13 counts pixels included in a local area as a histogram for each edge direction value. Histogram feature configuration section 13 links these edge direction value histograms for each local area to all local areas and creates a feature vector (see FIG. 2e). In FIG. 2, histogram feature configuration section 13 creates a feature vector in local area k of the thick-frame part in FIG. 2c. 
Determination section 15 determines whether or not a feature vector for input image data created in this way is a target object, using discriminant function 14 created beforehand by means of offline learning processing, and outputs the result.
A window image used in human detection generally permits fluctuation according to a person's attitude, and includes not only a person area for using edge data with respect to a background but also a background area (see input image data 11 in FIG. 2, for example).