1. Technical Field
The invention is related to a system and process for locating people and objects of interest in a scene, and more particularly, to a system and process that locates and clusters three-dimensional regions within a depth image, and identifies the content and position of clustered regions by comparing the clusters to a model.
2. Related Art
Most current systems for determining the presence of persons or objects of interest in an image of a scene have involved the use of a sequence of pixel intensity-based images or intensity images for short. For example, a temporal sequence of color images of a scene is often employed for this purpose [1]. Persons or objects are typically recognized and tracked in these systems based on motion detected by one of three methodsxe2x80x94namely by background subtraction [2], by adaptive template correlation, or by tracking color contour models [3, 4].
While the aforementioned locating methods are useful, they do have limitations. For example, the use of intensity images results in the presence of background xe2x80x9cclutterxe2x80x9d that significantly affects the reliability and robustness of these techniques. In addition, the adaptive templates employed in the adaptive template correlation techniques tend to drift as they pick up strong edges or other features from the background, and color contour tracking techniques are susceptible to degradation by intensity gradients in the background near the contour. Further, the image differencing methods typically used in the foregoing techniques are sensitive to shadows, change in lighting conditions or camera gain, and micro-motions between images. As a result, discrimination of foreground from background is difficult.
More recently, the use of sequential range images of the scene has been introduced into systems for locating persons and objects, and for tracking their movements on a real time basis [5, 6, 7]. In general, the advantage of using range images over intensity images is that the range information can be used to discriminate the three-dimensional shape of objects, which can be useful in both locating and tracking. For example, occluding surfaces can be found and dealt with as the tracked object moves behind them. Recognizing objects is also easier, since the actual size of the object, rather than its image size, can be used for matching. Further, tracking using range information presents fewer problems for segmentation, since range information is relatively unaffected by lighting conditions.
While the locating and tracking systems employing range information can provide superior performance in comparison to systems employing only intensity images, there is still considerable room for improvement. For example, the aforementioned systems use range information typically for background subtraction purposes, but rely mostly on intensity image information to locate individual people or objects in the scene being analyzed. Further, when using a background subtraction process, objects in the scene being analyzed tend to separate into a plurality of distinct three-dimensional regions. For these and other reasons, systems using such methods tend to exhibit poor discriminatory ability when two people or objects are close together in the scene. The system and process according to the present invention resolves the deficiencies of current locating and tracking systems employing range information.
It is noted that in the preceding paragraphs, the description refers to various individual publications identified by a numeric designator contained within a pair of brackets. For example, such a reference may be identified by reciting, xe2x80x9creference [1]xe2x80x9d or simply xe2x80x9c[1]xe2x80x9d. Multiple references are identified by a pair of brackets containing more than one designator, for example, [5, 6, 7]. A listing of the publications corresponding to each designator can be found at the end of the Detailed Description section.
The present invention involves a new system and process for use in an object recognition scheme for comparing three-dimensional regions (referred to as xe2x80x9cblobsxe2x80x9d) in images to one or more models in order to identify the location of people or objects within a scene. This object recognition scheme allows for real-time location and tracking of people or objects of interest within the scene. The technique generally entails first generating an initial three-dimensional depth image, often referred to as a background or baseline depth image, of the scene or area of interest. The baseline depth image is generated using conventional methods such as a stereo camera mechanism. Conventional processing of the baseline depth image is used to identify the spatial coordinates of three-dimensional image pixels within the three-dimensional volume represented by the image. During identification and location operations, an image acquisition process, such as, for example, a stereo camera mechanism, is used to capture live depth images at any desired scan rate. The identification and location of people and or objects may then be determined by processing a working image obtained from a background subtraction process using the baseline depth image and a live depth image. In other words, the baseline depth image is subtracted from the live depth image. Any pixel in the live depth image that differs significantly from the background image becomes part of the working image that is then processed to identify and locate people or objects.
The aforementioned background subtraction process typically results in a depth image containing a number of distinct three-dimensional regions or xe2x80x9cblobs.xe2x80x9d Each resultant blob in the working image is formed of a plurality of image pixels having x, y, and z coordinates defining the spatial location of each pixel within the three-dimensional space representing the scene. The subtraction process typically results in a number of distinct blobs for several reasons. First, featureless or textureless regions within the area of interest do not typically provide good depth data when using stereo cameras. These regions are typically broken up or eliminated in the subtraction process. Consequently, a uniformly lit person wearing relatively smooth solid color clothing such as a jacket or shirt would tend to be represented in the working image as a number of separated blobs. Further, noise in either the baseline or live depth images may cause people or objects to partially blend into the background. As a result, people or objects again tend to break up into a number of separated blobs in the working image. In addition, image noise or distortion, or extraneous objects not of interest, may create spurious blobs that also become part of the working image.
Processing of the working image involves identifying which of the blobs belong to the same person or object of interest so as to accurately identify and locate that person or object within the area of interest. A xe2x80x9cclusteringxe2x80x9d process is used to roughly identify each set of blobs in the working image that may belong to a particular person or object of interest. An analysis of the blob clusters produced by the clustering process is used to identify clusters of blobs that most accurately represents the people or objects of interest by determining the closest match or matches to a model representing the people or objects of interest. The model is a shape such as an ellipsoid having the approximate dimensions of the person or object of interest. In addition, blob clusters may be compared to any number of different models representing people or objects of different shapes and sizes.
One method for determining the closest match between a cluster of blobs and a model is to compare every possible cluster of blobs to the model. However, as the number of blobs increases, a corresponding exponential increase in the number of candidate blob combinations reduces the performance of this method. Further, with this method, some candidate blob clusters are either too small or too large to compare favorably to the model, and such comparisons tend to waste both time and computing power.
A more preferred method for generating candidate clusters of blobs is to connect all blobs based on a minimum spanning tree. To this end, all the blobs are connected together via the shortest total length of lines. The length of a connection between any two blobs may be determined using any consistent method for determining distance between the blobs. For example, one such method would compute the distance between the centroids of connected blobs. Another exemplary method is to connect two blobs with a line segment or link between the centroids of the blobs, then to compute the length of the portion of the line segment between the point where the line leaves the first blob and the point where it enters the second blob. Still another exemplary method is to compute the distance between the nearest pair of points on two neighboring blobs.
The minimum spanning tree method provides a starting point for ensuring that blobs which are physically close together are used to generate candidate clusters, while blobs which are further apart are not. Specifically, once all blobs have been connected, connection links that exceed a prescribed threshold distance are eliminated. Elimination of these longest links serves to eliminate some or all links to spurious blobs, and to reduce the number of blob clusters likely to be identified as invalid based on the subsequent comparison to the model. Of the remaining n links, candidate clusters of blobs are then generated by eliminating all possible combinations of the longest m links, where m is an integer value between 0 and n. In other words, every possible combination of connected blobs, produced by every possible combination of elimination of the longest m links, serves to generate an initial group of candidate blob clusters.
For example, where elimination of links exceeding the aforementioned threshold distance leaves at least m links, subsequent elimination of all possible combinations of the longest m links will generate a set of 2m groups of candidate blob clusters, with each group comprising a number of distinct clusters of blobs. As the value of m is increased, the number of possible groups of blob clusters increases exponentially. Consequently, a larger sample of candidate clusters is generated as m is increased, thereby improving system accuracy, but decreasing system speed by requiring a larger number of comparisons of candidate blob clusters to the model.
In each group, any cluster of blobs having an area that is not within a prescribed size range is discarded. This prescribed size range is a function of the size of the model. In other words, any cluster of blobs having an area that is obviously too small or too large to correspond to the person or object of interest is discarded. The area of a cluster of blobs is preferably approximated by summing the area of each image pixel making up the blobs in that cluster.
Each of the remaining candidate blob clusters is then compared to a model to determine whether it corresponds to a person or object of interest. Comparison of a candidate blob cluster to the model is accomplished by first computing the three-dimensional (x, y, z) mean or centroid of each blob cluster. The spatial coordinates of this centroid are then subtracted from the spatial coordinates of each pixel in the blob cluster to center the spatial location of the blob cluster. The covariance matrix for that cluster""s constituent centered image pixels is then computed. Next, the first two eigenvalues of the covariance matrix are used to define an ellipsoid to represent the candidate blob cluster. Specifically, the first eigenvalue provides the half-length of the major axis of the ellipsoid, while the second eigenvalue provides the half-length of the next longest axis of the ellipsoid. The model of the person or object of interest is defined by the expected values of these two eigenvalues. Consequently, a comparison is made of the eigenvalues of each group of candidate clusters and the expected values associated with the model to determine which group of candidate clusters is closest to the model. The group of clusters having the smallest deviation to the model is chosen as best representing persons or objects of interest contained in the working image. The spatial location of each of the chosen blob clusters in the working image corresponds to the spatial location of the persons or objects of interest in the live depth image.
Sequential live depth images may be captured and analyzed using the methods described above to provide for continuous identification and location of people or objects as a function of time.
In addition to the just described benefits, other advantages of the present invention will become apparent from the detailed description which follows hereinafter when taken in conjunction with the drawing figures which accompany it.