Images can generally be divided into a static image and dynamic video images. In one of two general categories of methods to detect a target (i.e., a specific object) in the dynamic video images or a static image, a classifier distinguishing a target from a background is created from a feature of the static image and used to detect an object or target in the image. With respect to the dynamic video images, each frame of image among the dynamic video images is regarded as a static image for detection. In the other category of methods, a specific object in the video image is detected by combining the static features of the images with information on correlation between frames, motion, voice, etc., of the video image. The foregoing former method is a basis for detecting the specific object in the image.
In Viola P, Jones M J, “Rapid Object Detection Using a Boosted Cascade of Simple Features”, Proc. of International Conference on Computer Vision and Pattern Recognition, 2001, 1:511-518 (herein after referred to as reference document 1), a target in a static image is detected using Haar-like rectangular features, and a boost approach is used to select automatically the feature(s) for use.
In Viola P, Jones M J, Snow D, “Detecting pedestrian using patterns of motion and appearance”, Computer Vision, 2003.734-741 (herein after referred to as reference document 2), Viola suggests that motion of a pedestrian in video has unique characteristics, and a feature regarding oriented amplitude of the motion can be extracted from a differential image between frames and a variation of the differential image and trained together with a static feature to derive a classifier. This method, however, can not be applied to a moving lens.
In Lienhart R, Maydt J., “An extended set of Haar-like features for rapid object detection”, IEEE ICIP, 2002 (hereinafter referred to as reference document 3), the rectangle features of a static image are extended by adding features including a polygon feature declining at an angle of 45 degrees, etc., and both the Haar-like feature and the rectangular feature are the sum of features of all pixels in a rectangular block without taking into account the distribution of the features in the block.
In N. Dalal, B. Triggs, “Histograms of Oriented Gradients for Human Detection”, Proc. of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005: 886-893 (hereinafter referred to as document 4), a pedestrian in an image is detected using Histograms of Oriented Gradients (HOG) features by calculating gradients at respective locations of a target, summing the calculated oriented gradients, taking a ratio of the sums of the gradients between areas as a feature and using a Support Vector Machine (SVM) for training. The target varying in a small range and with a small angle can be accommodated due to the statistical properties of the histogram.
In N. Dalal, B. Triggs, and C. Schmid, “Human Detection Using Oriented Histograms of Flow and Appearance”, Proc. European Conference on Computer Vision, 2006 (hereinafter referred to as document 5), Oriented Histograms features are taken from an optical flow field of video to extract a motion feature of a pedestrian, and the Oriented Histograms features are used for detection in combination with static Histograms of Oriented Gradients features. The Histograms of Oriented Gradients features are also features based on the rectangular blocks, and are obtained by summing the features in the block and calculating a ratio of the allocation ratios of the features allocated among blocks without taking into account the distribution of the features in the block.
In Qiang Zhu, Shai Avidan, Mei-Chen Yeh, Kwang-Ting Cheng, “Fast Human Detection Using a Cascade of Histograms of Oriented Gradients”, Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 2, pp. 1491-1498, 2006 (hereinafter referred to as document 6), a method is proposed for rapid detection using HOG features with varying sizes. This method calculates firstly integral images of respective oriented gradients and then calculates simplified HOG features from the integral images. This method changes the size of the feature instead that of an image to detect persons with different sizes. Such a practice equivalently modifies a classifier and thus causes a loss of performance. Moreover, this detection method operates at the QVGA in approximately 200 ms, which means that it is not real time. Incidentally, the OVGA stands for a fixed resolution with Q for Quarter and OVGA for a quarter in size of the VGA, that is, a display is presented on a screen at a resolution of 240×320 pixels.
Moreover, no classifier has perfect performance, and there are possibilities that an improper response indicating detection of an object is made at a location where the object is absent or a plurality of detection responses are made around an object, and thus, a post-processing method for removing the improper response and combining the repeated responses is required. In an existing object detection method, it is typical to determine an overlapping extent of a series of detection windows resulting from the processing by a classifier. Then, these detection windows are post-processed according to the determined overlapping extent of the detection windows to determine the presence and location of a specific object in an image to be detected. Specifically, if an overlapping extent between two detection windows is less than a determined threshold value, then both of the detection windows are determined to be related to the same specific object and then combined into a detection window related to the specific object. However, this method suffers from low precision of processing. Moreover, this method does not work well in the case that specific objects in the image to be detected overlap partially, because detection windows corresponding to the different specific objects may be determined as those related to the same specific object and combined so that the specific objects overlapping partially can not be distinguished accurately. In Navneet Dalal, “Finding people in images and videos” published as his doctor thesis in July, 2006, a mean-shift based post-processing method was proposed. This method performs the post-processing mainly through a typical peak search approach, but it still fails to distinguish satisfactorily objects (persons) in proximity or even overlapping partially and suffers from a complex process and a heavy processing load on the system.