The general problem is to search in images for the presence of targets of various types, which may be objects or persons, the targets presenting certain characteristics conforming to a model. For example, it may involve a parametric model, such as a ratio between width and height, which must have a given value λ, or a three-dimensional CAD model.
A method of this type for detecting targets based on a model becomes difficult to carry out in the event of substantial variability of appearance of the targets. For example, the appearance of a person may vary substantially according to his posture or clothing. The method may even become impossible to carry out. For example, the operator of a parking area will have immense difficulty in detecting trucks when he does not have the CAD models of the different types of truck.
In these cases where the modeling of targets proves difficult or even impossible, a known solution consists in carrying out a statistical learning step OFF-LINE, i.e. prior to the operation of the detection system, and a classification step ON-LINE, i.e. simultaneously with the operation of the detection system. In fact, the classification step forms an integral part of the detection process: if a system for detecting pedestrians is considered, a detection takes place when a target has been classified as a “pedestrian”.
The prior statistical learning step consists in learning to recognize targets using an algorithm which automatically extracts the most relevant parameters of the targets in order to distinguish them from the other elements which may be present in the images. This in fact involves creating statistical models of data extracted from a collection of “typical” images of targets. These statistical models are used later during the simultaneous classification step. The simultaneous classification step is carried out in real time on the images most recently supplied by the cameras. It involves comparing new data extracted from the “real” images with the statistical models during the learning step on the basis of “typical” images.
Thus, systems already allow the detection and recognition of stationary or mobile objects or persons using pairs of images supplied by calibrated cameras forming a stereoscopic head, for example two horizontally disposed cameras.
A system of this type first calculates a disparity map for each pair of images, representing the difference between the left image and the right image. More exactly, the disparity is the difference in pixel position between two images for the same observed point of the scene. Through triangulation, this deviation allows the z coordinate of the pixels of the image to be calculated and therefore depth information (3D) on the observed scene to be obtained. Sometimes represented by grey levels, a disparity map of this type is also generally referred to as a disparity image.
A system of this type then models the appearance of the objects present in the image during a statistical learning process. This process is based on a set of descriptors calculated in the image, such as the grey levels, the RGB (Red-Green-Blue) data, the successive derivatives of the signal, convolutions by a set of specific filters or histograms.
Finally, the video flows supplied by the two cameras allow a map to be calculated of the estimated 2D positions of the pixels for each of the two left and right cameras. This information is important for distinguishing moving objects. It allows better segmentation of the objects, notably when the camera is stationary or when the movement of the objects is sufficiently different from that of the camera, such as, for example, a pedestrian crossing the road in front of a moving automobile carrying the cameras.
For example, the article “Improved Multi-Person Tracking with Active Occlusion Handling” (A. Ess, K. Schindler, B. Leibe, L. van Gool, ICRA Workshop on People Detection and Tracking, May 2009) describes such a method of detection, recognition and even tracking of objects using a stereoscopic head. It carries out a plurality of steps to integrate the previously described luminance, position and depth information. This method is based on the prior detection of areas of interest and on the representation of these areas by a dictionary of elementary patterns. The learning step comprises a learning step of a “Codebook of Local Appearance”, which is a dictionary of elementary visual patterns which may be encountered on objects, and a learning step of “Implicit Shape Models”, which are the relative positions of these elementary patterns on the objects. During the classification step aiming to detect objects, a first detector searches, in the images and on the basis of the dictionary, for areas of interest likely to contain objects, then a second detector searches for objects in the areas of interest. Finally, a voting mechanism allows recognition of the objects.
A major disadvantage of this method is that it is based on a plurality of successive pre-detection steps, firstly pre-detection of areas of interest, then of objects in the areas of interest and then only recognition of the objects through classification. A significant number of non-detections may result from this, as these successive pre-detection steps are based in a certain manner on an “all or nothing” mechanism: if an upstream step yields a negative result, the downstream steps are not even carried out, even though they could have proven effective in “correcting” the non-detection of the upstream step. And if, in order to attempt to weaken this “all or nothing” mechanism, the number of areas of interest is increased, a veritable explosion in calculation times then occurs.