The presently disclosed embodiments are related to image processing and more particularly to video image processing to track the movement of objects. Embodiments provide improved tracking, especially where the tracked object or objects of interest pause in their movements.
In video analytics applications, it is often desirable to track individual objects as they move through the field of view of the camera. Initially, it is necessary to identify new objects of interest for tracking. A number of methods exist for performing this type of object detection on video frames.
For example, some approaches rely on the use of a model of an image background to help identify new objects in a scene by a process of elimination. However, approaches that use background models (e.g., Gaussian mixture models) of the scene require an initialization period, for instance, to capture the scene without objects of interest and under various lighting and other conditions. The learned background models must also be updated after the initial training period in order to incorporate slow changes in the background scene over time. Additionally, background model-based approaches can require more computational resources than simpler methods.
Frame-to-frame (or temporal) differencing methods use motion information to detect moving objects. Here large changes in pixel values from one frame to the next are indicative of moving foreground objects of interest.
Both model-based and frame-differencing methods for object detection can struggle in scenarios in which the objects of interest undergo frequent stop/start actions within the field of view. The techniques rely at least in part on movement to differentiate between background objects and objects of interest (foreground objects). During pauses in movements, the now non-moving objects tend to fade into the “background” in the view of these adaptive classifications. Accordingly, due to stop and restart events, object detection procedures can be fooled into generating erroneous detections. This can lead to increased computational load through unnecessary processing of additional tracking points. Moreover, in applications where tracking individual objects is critical to automated measurements, these false detections can lead to erroneous measurements. For instance, in a side-by-side drive-thru application where video-based analytics might be used for automated vehicle sequencing to keep track of the relative position of vehicles to aid the delivery of the appropriate drive-thru order, these false detections can lead to reductions in the accuracy.
For each image frame in the input video stream, a foreground pixel mask (or map identifying regions of interest in the image as compared to background portions) is generated. As indicated above, this can be accomplished using a number of methods including, for example, Gaussian mixture model (GMM) based background modeling or frame-to-frame differencing. Background model-based methods such as GMMs are popular because they tend to give robust performance despite extraneous motion within the scene (e.g., tree leaves shaking in the wind). However, they tend to be computationally more expensive than frame-differencing (FD) methods and require an initialization period.
For applications in which the video analysis system might have to function from a “cold start,” this initialization requirement can prove difficult. Foreground pixel detection methods based on FD look for moving objects by thresholding differences in pixel values between successive frames in the video sequence. Although they require less computation, FD methods are more susceptible to extraneous motion in the scene and cannot detect stationary objects as foreground.
Both model-based and FD approaches to pixel level foreground detection can struggle with objects that undergo numerous stop/start events within the scene. For FD methods this is readily apparent since the approach relies on detecting change/motion. For model-based methods, stop/start events can also prove troublesome since a stationary object will actually be slowly learned into the background model. As such, one of the difficult design decisions for a model-based object detection method is the choice of the learning rate—too fast and objects blend quickly into the background model; too slow and changes in illumination and slow background movement (e.g., tree leaves) are not well handled by the model.
Given these issues, one challenge for both approaches is that the foreground pixel mask or blob associated with an object undergoing a stop/start cycle can split into several pieces, such as a first segment 110 and a second segment 114, as illustrated in FIG. 1.
Pixels or regions within an image are classified as “foreground” when they are, or are expected to be, components of objects or areas of interest. Identification of foreground pixels can be achieved in a number of ways including: background subtraction, wherein foreground regions are defined as those which differ significantly from the background model, and motion detection. Motion detection approaches typically use either temporal (frame-to-frame) differencing or optical flow methods to identify moving objects in the image frame. The moving objects are considered foreground.
As indicated above, background pixels or regions are, by definition, not foreground. Background may also be defined as typical image content for the scene which remains relatively static (changes slowly) over time. One method for determining background is to build a background model of the scene when there are no objects of interest present. This model then accounts for what is typical of non-foreground content. Note that this model can still account for variability such as the motion of trees in the wind.
A blob as mentioned above and used herein can be, for example, a region of connected components. Connected components are sets of pixels, or regions, in an image all having the same or similar (depending on the application) pixel values and wherein the pixels in the set are connected to one another. Here connection refers to the ability to traverse from any pixel in the set to any other pixel in the set via neighbor pixels that are also contained in the set. Neighbor pixels are adjacent pixels to a given pixel.
The over segmentation of image objects caused by the reaction of object tracking video processing techniques to pauses in object motion can affect the subsequent detection of new objects based on the foreground pixel mask. More specifically, with reference to FIG. 1, a first set 118 and second set 122 of marked tracking features and tracker centroids (large squares 126, 130) are identified for a vehicle 134. These tracking features were grouped as separate trackers 118, 122 due to a segmentation of an object blob (see 210 of FIG. 2) associated with the vehicle 134 into the first 110 and second 114 segments and subsequent resulting classification of vehicle portions as separate vehicles or objects of interest. Such a misclassification can have an effect on the subsequent measurement of vehicle sequence as the cars merge into a single lane, since the vehicle 134 will be interpreted as multiple merge events due to the two associated trackers 118, 122.
Tracking features or attributes are aspects of an image object that are used to follow the object as it moves in the scene. There are many different kinds of features used in video tracking. The types of features used depend on the method used for tracking the features from frame-to-frame in the video stream. Examples of common features include: scale invariant features such as scale invariant feature transform (SIFT) features or speeded up robust features (SURF), interesting points such as Harris corner features, maximally stable extremal region (MSER) features, color or grayscale histograms, shape features of the target object blob such as area, orientation, aspect ratio, contour, and template images of the objects of interest or one or more sub-regions of the object of interest.
An object tracker is a set of tracking features associated with an object of interest which are associated with a corresponding method for following these features from frame-to-frame in a video stream. Based on the location of the features for the current frame, the tracker typically also maintains an overall estimate of the pixel location of the object in the image frame. This can be accomplished, for instance, with a centroid of the positions of the individual feature points. Tracking points are pixel locations of individual tracked features for a given tracker.
A method to address the “over-segmentation” of moving foreground blobs in object detection is to apply a threshold on the number of sightings before a new object is considered “real.” Typically, this thresholding would be performed within a region of interest in the scene where new objects tend to appear. For instance, in an application for monitoring highway traffic, the lanes of interest tend to be unidirectional. In other words, objects enter the scene near the bottom/top and exit near the top/bottom. Thus, the region of interest for detecting new objects is fixed. Likewise in a drive-thru monitoring scenario, the order entry points can be fixed regions of the scene where new vehicle detections can be expected. By looking for the occurrence of foreground blobs in these regions, new trackers can be assigned only when blobs are detected in the region of interest a minimum number of times.
However, for scenarios like a tightly packed drive-thru, the objects of interest (e.g., people, vehicles) can undergo many stop/start events and can move at highly variable speeds through the scene. This presents a challenge to existing methods for object detection due to the occurrence of many pause-in-motion generated “over-segmentations.” Existing threshold-of-occurrence-based methods do not provide sufficient accuracy for such high stress object tracking applications.