The field of computer vision often requires classifiers that are trained to detect objects such as faces and people, with a view to enabling applications that interact with people and real-world objects. A variety of classifiers exist, as computer vision researchers are consistently seeking more resource-efficient methods for accurately locating and identifying various objects in images.
One known method of identifying a particular class of object, described in FIG. 1, uses histograms of oriented gradients (HoG) in conjunction with training images and a learning system. HoG has been used to detect humans against a variety of backgrounds, as well as faces, animals, vehicles, and other objects. Because HoG uses a relatively compact reference descriptor, it has been successfully used in real-time to classify objects in streaming video. It has also been demonstrated to enable robust detection in the presence of rotations, scaling, and variations in terms of lighting conditions.
FIG. 1 illustrates a process 100 known in the art for classifying objects in images using HoG in conjunction with a support vector machine (SVM) algorithm—fittingly referred to in the art as HoG/SVM. The process as described uses the parameters identified by Dalal and Triggs in their 2005 paper: “Histograms of oriented gradients for human detection,” International Conference on Computer Vision and Pattern Recognition, Vol. 2, pp. 886-893, June 2005, which is herein incorporated by reference in its entirety.
First, gradient values are calculated for each pixel within a particular cell (step 102 in FIG. 1). As shown in FIG. 2, which illustrates the process, a defined rectangular HoG detection window 202 is applied to a portion of the image, which divides the pixels into discrete cells 204. An HoG cell 204 may, for example, comprise 8 pixels on each side for a total of 64 (8-by-8) pixels per cell 204, although larger or smaller cell sizes may be chosen in some implementations. For each cell 204, a magnitude and orientation of a gradient is calculated. A variety of filters may be applied to calculate these values. For example, as one implementation, the magnitude of the gradient |G| may be given according to the intensity values of its adjacent pixels:|G|=|Gx|+|Gy|=|Gx+1−Gx−1|+|Gy+1−Gy−1|.And the orientation θ may be given according to the tangent of the horizontal x and vertical y intensities:θ=arctan(|Gy+1−Gy−1|/|Gx+1−Gx−1|)
To create the histogram, the orientation angles θ are broken up into some number of bins. In this example, the range of 0° to 180° is broken into nine bins of 20° each. Each intensity value |G| is added to the bin associated with its orientation angle θ (step 104 in FIG. 1). The resulting HoG cell descriptor, illustrated as 206 in FIG. 2, has 9 values each with a minimum of zero and a maximum of 128 times the maximum pixel intensity value.
Each of the cell descriptors 206 is then aggregated into block descriptors 210 (step 106 in FIG. 1), based on each 2-by-2 block 208 of four cells. Because every block 208 of cells is used, a cell 204 not on the edge of the window 202 will appear in four different blocks 208, and therefore its descriptor 206 will be included in four different block descriptors 212.
Each block descriptor 210, including the descriptors 206 of each of the four cells 204 in the block 208, is normalized according to the descriptors in that block (step 108 in FIG. 1). A variety of normalization algorithms can be used, many of which are discussed in the Dalai and Triggs 2005 paper referenced above. The result of this process is a normalized block descriptor 212 for each block 208, a set of histogram data representing 36 data elements per block. Because the normalization depends on the values of the four descriptors 206 in a particular block descriptor 210, the normalized values associated with a particular cell 206 may be different in each normalized block descriptor 212 that includes that cell.
For a 64-by-128 pixel window 202, the complete HoG descriptor 214 representing the normalized block descriptors 212 comprises 105 normalized blocks of histogram data: a total of 3,780 data values. This complete descriptor 214 is fed into the SVM classifier (step 110 in FIG. 1), which has previously evaluated training images according to the same parameters. The training images may be any appropriate set of training data for the objects being evaluated, such as the MIT and INRIA image data sets described in Dalai and Triggs 2005 paper. Other publicly available or proprietary training images can be used.
The HoG computation is performed by repeatedly stepping a window, 64 pixels wide by 128 pixels high across in the illustrated example, across a source image frame and computing the HoG descriptor as outlined in the previous section. As the HoG calculation contains no intrinsic sense of scale and objects can occur at multiple scales within an image, the HoG calculation is stepped and repeated across each level of a scale pyramid.
FIG. 3 illustrates a window 302 being stepped across each level 304 of the scale pyramid 306. Each level 304 represents a further scaled-down copy of the image that is being scanned. The scaling factor between each level in the scale pyramid between one level and the next is commonly 1.05 or 1.2. The image is repeatedly down-scaled until the scaled source frame can no longer accommodate a complete HoG window.
The closed form for the number of pixels in the scaling pyramid is given by an expression based on s. s is the scale multiplier for the total number of pixels in the scale pyramid:
  s  =                    α                  -          m                    -      α              1      -      α      α is the scaling factor used between pyramid levels. m=log(W/H)/log(α), where W and H are the respective width and height of the input image/video frame. The total number of pixels to consider in the scale pyramid is therefore s*W*H.
As can be seen in FIG. 2 and shown in the calculation above, the HoG descriptor for a system using 9D histograms for each of the 7-by-15 4-by-4 blocks in the image produces a 3.78 kB descriptor for each 64-by-128 window that is examined in the upcoming image.
The images used to train such classifiers are typically rectangular as a by-product of the 2D image sensor arrays used to capture images. Add to this the simplicity to stepping a rectangular descriptor across a rectangular source image and convolving to detect a match and it is easy to see why this paradigm has taken root. While some objects, such as furniture, may indeed be square or rectangular, most objects of interest in classifying are not easily representable by simple geometric shapes. Therefore, a rectangular reference image is a poor match to such objects. Indeed, using a rectangular reference image means that significant additional work has to be done to convolve pixels that are not relevant to the matching task, and furthermore these pixels mean that some of the background surrounding the object of interest are aliased into the descriptor used to match images, thus confounding and degrading the accuracy of the matching operation.
The computational cost of each HoG data set is very high. One estimate is made by Dziri, Chevobbe, and Darouich in their 2013 paper: “Gesture recognition on smart camera,” CEA LIST—Embedded Computing Laboratory, 2013. For example, to apply HoG to a 42-by-42 pixel region of interest requires the following operations: 11,664 addition, 1,296 multiplication, 5,200 division, 16 square root, and 5184 arctangent. The computation requires numerous costly and complex mathematical operations like division, square root, and arctangent, which take multiple cycles to implement on a conventional sequential processor in software. The computation also requires large numbers of more common mathematical operations like addition and multiplication, which typically execute in as little as one clock cycle. The computational costs are compounded by the fact that performing a brute-force search by stepping an HoG template for comparison over the entire image is even more computationally expensive depending on the resolution of the input image. Furthermore, in scenarios where objects may be seen at a range of distances, it is often necessary to search candidate windows of different sizes, further increasing the computational cost.
HoG/SVM is a very expensive operation. Many optimizations, from changing the scale factor to modifying the block-size in which the HoG window is stepped across the scaled source image, can be used to prune the search space and hence limit the computational effort. These factors combined mean that robust real-time HoG is confined to very high specification desktop systems that often offload computations to a high performance general processing unit (GPU). This pushes the power costs far beyond the bounds of mobile devices such as phones, tablets and mobile robots.
While it is possible to subsample the input image and perform a range of optimizations for mobile platforms, this often comes at a huge loss in terms of matching accuracy rendering the mobile implementation of very limited utility. Nonetheless, further optimizations to limit the computational expense of HoG processes are desired.