This invention relates generally to image processing systems and more particularly to systems for detecting objects in images.
As is known in the art, an analog or continuous parameter image such as a still photograph or a frame in a video sequence may be represented as a matrix of digital values and stored in a storage device of a computer or other digital processing device. When an image is represented in this way, it is generally referred to as a digital image. It is desirable to digitize an image such that the image may be digitally processed by a processing device.
Images which illustrate items or scenes recognizable by a human typically contain at least one object such as a persons face, an entire person, a car, etc . . . Some images, referred to as xe2x80x9cclutteredxe2x80x9d images, contain more than one object of the same type and/or more than one type of object. In a single image or picture of a city street, for example, a number of objects such as people walking on a sidewalk, street signs, light posts, buildings and cars may all be visible within the image. Thus, an image may contain more than one type or class of object (e.g. pedestrians as one class and cars as a different class) as well as multiple instances of objects of the same type (e.g. multiple pedestrians walking on a sidewalk).
As is also known, object detection refers to the process of detecting a particular object or a particular type of object contained within an image. In the object detection process, an object class description is important since the object detection process requires a system to differentiate between a particular object class and all other possible types of objects in the rest of the world. This is in contrast to pattern classification, in which it is only necessary to decide between a relatively small number of classes.
Furthermore, in defining or modeling complicated classes of objects (e.g., faces, pedestrians, etc . . . ) the intra-class variability itself is significant and difficult to model. Since it is not known how many instances of the class are presented in any particular image or scene, if any, the detection problem cannot easily be solved using methods such as maximum-a-posteriori probability (MAP) or maximum likelihood (ML) methods. Consequently, the classification of each pattern in the image must be performed independently. This makes the decision process susceptible to missed instances of the class and to false positives. Thus, in an object detection process, it is desirable for the class description to have large discriminative power thereby enabling the processing system to recognize particular object types in a variety of different images including cluttered and uncluttered images.
One problem, therefore, with the object detection process arises due to difficulties in specifying appropriate characteristics to include in an object class. Characteristics used to specify an object class are referred to as a class description.
To help overcome the difficulties and limitations of object detection due to class descriptions, one approach to detect objects utilizes motion and explicit segmentation of the image. Such approaches have been used, for example, to detect people within an image. One problem with this approach, however, is that it is possible that an object which is of the type intended to be detected is not moving. Thus, in this case, the utilization of motion would not aid in the detection of an object.
Another approach to detecting objects in an image is to utilize trainable object detection. Such an approach has been utilized to detect faces in cluttered scenes. The face detection system utilizes models of face and non-face patterns in a high dimensional space and derives a statistical model for the a particular class such as the class of frontal human faces. Frontal human faces, despite their variability, share similar patterns (shape and the spatial layout of facial features) and their color space is relatively constrained.
Such an approach, without a flexible scheme to characterize the object class, will not be well suited to provide optimum performance unless the objects such as faces have similar patterns (shape and the spatial layout of facial features) and relatively constrained color spaces. Thus, such an approach is not well-suited to detection of those types of objects, such as pedestrians, which typically have dissimilar patterns and relatively unconstrained color spaces.
The detection of objects, such as pedestrians for example, having significant variability in the patterns and colors within the boundaries of the object can be further complicated by the absence of constraints on the image background. Given these problems, direct analysis of pixel characteristics (e.g., intensity, color and texture) is not adequate to reliably and repeatedly detect objects.
One technique, sometimes referred to as the ratio template technique, detects faces in cluttered scenes by utilizing a relatively small set of relationships between face regions. The set of relationships are collectively referred to as a ratio template and provide a constraint for face detection. The ratio template encodes the ordinal structure of the brightness distribution on an object such as a face. The ratio template consists of a set of inequality relationships between the average intensities of a few different object-regions. For example, as applied to faces, the ratio template consists of a set of inequality relationships between the average intensities of a few different face-regions.
This technique utilizes the concept that while the absolute intensity values of different regions may change dramatically under varying illumination conditions, their mutual ordinal relationships (binarized ratios) remain largely unaffected. Thus, for instance, the forehead is typically brighter than the eye-socket regions for all but the most contrived lighting setups.
The ratio template technique overcomes some but not all of the problems associated with detecting objects having significant variability in the patterns and colors within the boundaries of the object and with detection of such objects in the absence of constraints on the image background.
Nevertheless, it would be desirable to provide a technique to reliably and repeatedly detect objects, such as pedestrians, which have significant variability in patterns and colors within the boundaries of the object and which can detect objects even in the absence of constraints on the image background. It would also be desirable to provide a formalization of a template structure in terms of simple primitives, a rigorous learning scheme capable of working with real images, and also to provide a technique to apply the ratio template concept to relatively complex object classes such as pedestrians. It would further be desirable to provide a technique and architecture for object detection which is trainable and which may also be used to detect people in static or video images of cluttered scenes. It would further be desirable to provide a system which can detect highly non-rigid objects with a high degree of variability in size, shape, color, and texture and which does not rely on any a priori (hand-crafted) models or on changes in position of objects between frames in a video sequence.
In accordance with the present invention, an object detection system includes (a) an image preprocessor for moving a window across the image and a classifier coupled to the preprocessor for classifying the portion of the image within the window. The classifier includes a wavelet template generator which generates a wavelet template that defines the shape of an object with a subset of the wavelet coefficients of the image. The wavelet template generator generates a wavelet template which includes a set of regular regions of different scales that correspond to the support of a subset of significant wavelet functions. The relationships between different regions are expressed as constraints on the values of the wavelet coefficients. With this particular arrangement, a system which is trainable and which detects objects in static or video images of cluttered scenes is provided. The wavelet template defines an object as a set of regions and relationships among the regions. Use of a wavelet basis to represent the template yields both a computationally efficient technique and an effective learning scheme. By using a wavelet template that defines the shape of an object in terms of a subset of the wavelet coefficients of the image, the system can detect highly non-rigid objects such as people and other objects with a high degree of variability in size, shape, color, and texture. The wavelet template is invariant to changes in color and texture and can be used to robustly define a rich and complex class of objects such as people. The system utilizes a model that is automatically learned from examples and thus can avoid the use of motion and explicit image segmentation to detect objects in an image. The system further includes a training system coupled to the classifier and including a database including both positive and negative examples; and a quadratic programming solver. The system utilizes a general paradigm for object detection. The system is trainable and utilizes example-based models. Furthermore, the system is reconfigurable and extendible to a wide variety of object classes.
In accordance with a further aspect of the present invention, a wavelet template includes a set of regular regions of different scales that correspond to the support of a subset of significant wavelet functions of an image. The relationships between different regions are expressed as constraints on the values of the wavelet coefficients. The wavelet template can compactly express the structural commonality of a class of objects and is computationally efficient. It is learnable from a set of examples and provides an effective tool for the challenging problem of detecting pedestrians in cluttered scenes. With this particular technique, a learnable wavelet template provides a framework that is extensible to the detection of complex object classes including but not limited to the pedestrian object class. The wavelet template is an extension of the ratio template and addresses some of these issues not addressed by the ratio template in the context of pedestrian detection. By using a wavelet basis to represent the template a computationally efficient technique for detecting objects as well as an effective learning scheme is provided.
The success of the wavelet template for pedestrian detection comes from its ability to capture high-level knowledge about the object class (structural information expressed as a set of constraints on the wavelet coefficients) and incorporate it into the low-level process of interpreting image intensities. Attempts to directly apply low-level techniques such as edge detection and region segmentation are likely to fail in the images which include highly non-rigid objects having a high degree of variability in size, shape, color, and texture since these methods are not robust, are sensitive to spurious details, and give ambiguous results. Using the wavelet template, only significant information that characterizes the object class, as obtained in a learning phase, is evaluated and used.
The approach of the present invention as applied to a pedestrian template is learned from examples and then used for classification, ideally in a template matching scheme. It is important to realize that this is not the only interpretation of the technique. An alternative, and perhaps more general, utilization of the technique includes the step of learning the template as a dimensionality reduction stage. Using all the wavelet functions that describe a window of 128xc3x9764 pixels would yield vectors of very high dimensionality. The training of a classifier with such a high dimensionality would in turn require an example set which may be too large to utilize in practical systems using present day technology.
The template learning stage serves to select the basis functions relevant for this task and to reduce their number considerably. In one particular embodiment, the twenty-nine basis functions are used. A classifier, such as a support vector machine (SVM) can then be trained on a small example set. From this point of view, learning the pedestrian detection task consists of two learning steps: (1) dimensionality reduction, that is, task-dependent basis selection and (2) training the classifier. In this interpretation, a template in the strict sense of the word is neither learned nor used. It should be appreciated of course that in other applications and embodiments, it may be desirable to not reduce the number of basis functions but instead it may be desirable to use all available basis functions. In this case, all of the basis functions are provided to the classifier.
In accordance with a still further aspect of the present invention, an object detection system includes an optical flow processor which receives frames from a video sequence and computes the optical flow between images in the frames and a discontinuity detector coupled to the optical flow processor. The discontinuity detector detects discontinuities in the flow field that indicate probable motion of objects relative to the background in the frame. A detection system is coupled to the discontinuity detector and receives information indicating which regions of an image or frame are likely to include objects having motion. With this particular arrangement an object detection system which utilizes motion information to detect objects is provided. The frames may be consecutive frames in a video sequence. The discontinuity detector detects discontinuities in the flow field that indicate probable motion of objects relative to the background and the detected regions of discontinuity are grown using morphological operators, to define the full regions of interest. In these regions of motion, the likely class of objects is limited, thus strictness of the classifier can be relaxed.