1. Technical Field
The invention is related to system and process for locating and tracking people and non-stationary objects of interest in a scene, and more particularly, to such a system and process that employs a series of range images of the scene taken over time.
2. Background Art
Most current systems for determining the presence of persons or objects of interest in an image of a scene have involved the use of a sequence of pixel intensity-based images or intensity images for short. For example, a temporal sequence of color images of a scene is often employed for this purpose [1].
Persons or objects are typically recognized and tracked in these systems based on motion detected by one of three methodsxe2x80x94namely by background subtraction [2], by adaptive template correlation, or by tracking color contour models [3, 4].
While the aforementioned locating methods are useful, they do have limitations. For example, the use of intensity images results in the presence of background xe2x80x9cclutterxe2x80x9d that significantly affects the reliability and robustness of these techniques. In addition, the adaptive templates employed in the adaptive template correlation techniques tend to drift as they pick up strong edges or other features from the background, and color contour tracking techniques are susceptible to degradation by intensity gradients in the background near the contour. Further, the image differencing methods typically used in the foregoing techniques are sensitive to shadows, change in lighting conditions or camera gain, and micro-motions between images. As a result, discrimination of foreground from background is difficult.
More recently, the use of sequential range images of the scene has been introduced into systems for locating persons and objects, and for tracking their movements on a real time basis [5, 6, 7]. In general, the advantage of using range images over intensity images is that the range information can be used to discriminate the three-dimensional shape of objects, which can be useful in both locating and tracking. For example, occluding surfaces can be found and dealt with as the tracked object moves behind them. Recognizing objects is also easier, since the actual size of the object, rather than its image size, can be used for matching. Further, tracking using range information presents fewer problems for segmentation, since range information is relatively unaffected by lighting conditions or extraneous motion.
While the locating and tracking systems employing range information can provide superior performance in comparison to systems employing only intensity images, there is still considerable room for improvement. For example, the aforementioned systems use range information typically for background subtraction purposes, but rely mostly on intensity image information to locate individual people or objects in the scene being analyzed. This can result in poor discriminatory ability when two people or objects are close together in the scene.
The system and process according to the present invention resolves the deficiencies of current locating and tracking systems employing range information.
It is noted that in the preceding paragraphs, the description refers to various individual publications identified by a numeric designator contained within a pair of brackets. For example, such a reference may be identified by reciting, xe2x80x9creference [1]xe2x80x9d or simply xe2x80x9c[1]xe2x80x9d. Multiple references are identified by a pair of brackets containing more than one designator, for example, [5, 6, 7]. A listing of the publications corresponding to each designator can be found at the end of the Detailed Description section.
The present invention involves a technique for locating and tracking people and non-stationary objects of interest in a scene using a series of range images of the scene taken over time. In regards to locating people and objects, the technique generally entail first generating the series of range images. Preferably, the series of range images is a continuous temporal sequence of depth maps of the scene, such as might be captured using a video-rate stereo imaging system or a laser range finder system. A background model is computed from a block of these range images. In general, this entails identifying pixel locations in the block of range images that have reliable depth values.
Once the background model has been computed, a range image generated subsequent to the aforementioned block of range images is selected for processing. Preferably, this entails selecting the very next range image generated following the last image of the block used to compute the background model. The background is subtracted from this currently selected range image based on the background model to produce a foreground image. Generally, this involves identifying those pixels representing non-static portions of the scene depicted in the selected range image based on the background model. These xe2x80x9cnon-staticxe2x80x9d pixels are collectively designated as the foreground image.
At this point, an optional procedure can be employed to connect regions associated with the same person or object that may have become separated by gaps in the preceding background subtraction. To accomplish this, a standard morphologically growing and shrinking technique can be implemented. Essentially, this involves using the technique to first grow the foreground image, and then shrink it, in such a way that pixels in the gaps between related regions are added to the foreground image when pixels in the vicinity of the gap exhibit similar depth values. This connects the regions. If, however, the pixels in the vicinity of the gap do not exhibit similar depth values, this is an indication they belong to a different person or object. In that case, the pixels in the gap are not added to the foreground image and the regions remain separated.
The foreground image is next segmented into regions, each of which represents a different person or object of interest in the scene captured by the currently selected range image. This is essentially accomplished by identifying regions in the foreground image made up of pixels exhibiting smoothly varying depth values. In addition, any region having an actual area too small to represent a person or object of interest is eliminated from further consideration as foreground pixels.
If it is not only desired to locate a person or object in the scene, but to determine their identity as well, the following optional procedure can be adopted. This optional procedure determines the identity of the person or object associated with each segmented region in the foreground image by capturing an intensity image of the scene simultaneously with the generation of the aforementioned currently selected range image. Each region of the intensity image that corresponds to a segmented region in the foreground image can then be identified and used to determine the identity of the person or object represented by that region. It is noted that while the optional identification process can be performed immediately after the foreground image segmentation procedure, it can be even more advantageous to wait until after an optional ground plane segmentation procedure that will be described shortly. In either case, the identification process generally entails first characterizing the identified region in a way similar to a series of previously stored intensity images of known persons and objects. For example, the identified region and stored images might be characterized via a color histogram technique. The characterized region is compared to each of the stored characterizations, and the degree of similarity between each of them is assessed. If the degree of similarity between the identified region and one of the stored characterizations exceeds a prescribed level, the person or object represented by the identified region is designated to be the person or object associated with that stored characterization.
Regardless of whether the segmented foreground image is used to assist in the identification of the people and objects in the scene being analyzed, the locating process continues by projecting the segmented regions of the foreground image onto a ground plane of the scene. This generally involves first computing the bounds of the ground plane for the scene depicted in the currently selected range image. It is noted that the computation of the ground plane boundary need only be performed once, and can be used unchanged with each subsequent range image. Next, the vertical, horizontal and depth coordinates of each pixel in each segmented region are identified and adjusted to compensate for any camera roll and pitch. The pixels are then projected onto the ground plane.
Ground plane coordinates determined in the projection procedure can be used to designate the location of each separate person or object of interest in the scene captured by the currently selected range image. Typically, the coordinates of the projection of the mean location of all pixels contributing to a given person or object, projected into the ground plane, is used to specify this location. However, it is preferred that an optional ground plane segmentation refinement technique be employed first to ensure each projected region represents only a single person or object. This is essentially accomplished by cordoning off the projected foreground image so as to divide each projected region into a series of cells. One or more peak cells is identified in each projected region. This is done by ascertaining which cells contain the greatest number of pixels in comparison to neighboring cells within a prescribed radius, as well as having a pixel density that exceeds a prescribed threshold. The threshold is indicative of the pixel density expected in a cell containing pixels representing a person or object of interest. For each peak cell identified, the regions contributing pixels to any neighboring cell within a prescribed radius of the peak cell are conglomerated with the peak cell. If any of the regions previously defined in the foreground image segmentation procedure have contributed pixels to more than one of the computed conglomerations, then it is likely there are two or more people or objects associated with that region. Accordingly, the region should be divided. The division is accomplished by reassigning the pixels in the aforementioned region to one or more of the computed conglomerations, depending on how many of the conglomerations the region contributed pixels to in the previous conglomeration process. Preferably, this reassignment is done by determining which peak cell is closest to each pixel of the region under consideration, and assigning it to the conglomeration associated with that peak cell. The newly defined conglomerations represent the ground plane segmented regions, each of which should now be associated with only one person or object.
As mentioned previously the optional identification procedure would preferably be performed at this point whenever the ground plane segmentation is employed to redefine the segmented regions. It is believed a more accurate identification of the people and objects represented by the redefined regions can be obtained by waiting until the ground plane segmentation is complete. Essentially, the identification process is the same as described above, except that the segmented ground plane regions are first projected back into the image plane of the foreground image, via conventional methods, before being used for identification purposes.
Once the location of each person or object of interest has been established, they can be tracked by analyzing subsequently generated range images. In simple terms, a range image generated subsequent to the previously selected range image is selected and designated as the currently selected range image. The foregoing location technique is then repeated beginning with the background subtraction procedure. Preferably, the newly selected range image is the image generated immediately following the previously selected range image. The tracking process continues by selecting the next range generated, and the next, and so on for as long as it is desired to monitor the location of a person or object in the scene.
It is noted that the above tracking procedure uses the same background model to analyze each subsequently generated range image. However, a new background model can also be computed for each new range image analyzed if desired. This is accomplished by re-computing the background model from a block of range images made up of a prescribed number of the images generated immediately preceding the currently selected range image. The rest of the process remains the same. Alternatively, other conventional background-image adaptation schemes may be used to update the background model in an ongoing fashion.