The tracking of an object may provide information for a variety of different applications. One particular environment in which object tracking is performed is a multi-target environment. For example, a sports event may utilize tracking in which the objects may be people (e.g., athletes, referees, etc.), a ball, etc. In another example, a town center region may utilize tracking in which the objects may be people. However, tracking multiple targets interacting in close proximity is a difficult problem due to each object not always being visible or isolated. That is, the objects may be occluded or occlude one another from object detectors.
A variety of approaches have been used for tracking these objects in the multi-target environment. A first type of tracking includes using only one type of information.                Specifically, this first type (tracking-by-detection) uses complete or full object detectors. Those skilled in the art will understand that full-object detectors have an increased number of criteria to determine whether an object has been detected. Accordingly, full-object detectors may result in an increased number of missed detections, particularly in a multi-target environment. Furthermore, although the entirety of an object may not be occluded, the parameters used to determine an object may prevent the full object detectors from tracking the object, thereby further increasing the number of missed detections.        
To accommodate for this drawback, a second type of tracking also includes using only one type of information. Specifically, this second type includes tracking objects by detecting (and possibly tracking) portions of an object to overcome the above noted difficulties associated with full or complete object detections. Specifically, parts of the object (e.g., head of a person) may remain visible more often than the full body. With less constricting parameters to determine a portion of an object, a detection may still result. However, because parts are smaller and often less discriminative, the number of false detections increases. A part-object detector may have settings such that a generally spherically shaped object having a size within a target range corresponds to a head of a person. However, there may be objects that also have this criteria. For example, a basketball is spherical and may be within the target range.
Further approaches have been developed in which one type of information is used to support another type of information in tracking an object. A third type of tracking thus includes using part-based information. Specifically, tracking-by-detection approaches use complete or full object detectors and attempt to interpolate through missing detections using motion priors. With part-based information (e.g., head detections), a tracking algorithm may ensure any hypothesized association between complete object detections (which may span implied missed detections) is at least supported by partial object information. Alternatively, if multiple parts are detected in each frame, the tracking algorithm may determine if there is enough partial evidence to justify a full object detection. For example, the object may still be sufficiently occluded to prevent the complete object detector from resulting as a positive determination but the partial based information may corroborate that the complete object detection indeed exists. It should be noted that an opposite use of partial information may also be used. For example, false detections from part-based detection information may be removed as positive detections based upon full body detection information.
Despite the use of a further type of information potentially overcoming some drawbacks, in both of the above described situations, partial object information may only be evaluated on a per-frame basis to justify either creating a missed complete object detection or supporting the association of two temporally separated detections which imply one or more missed detections therebetween. As a result, even though multiple types or modes of information may be used in the tracking process, greedy decisions are made early so that the data association step only needs to deal with a single homogenous type of information, most typically full object detections. That is, even with a second type of information being used, ultimately only a single type of information is utilized in tracking an object.
FIG. 2A shows a first set of exemplary body detections 200. Using this manner of determining body detections, the body detections 200 include positive detections 205, missed detections 210, and false detections 215. Specifically, the body detections 200 are in the ground plane in which a projection is made from a position of the feet of a person and a hallucinated head at a predetermined height, such as 1.5 meters above the feet in the image plane. In this particular scene, almost all desired objects have been detected resulting in positive detections 205. However, at least two objects are not detected resulting in the missed detections 210. Although unlikely when detecting the body due to the greater criteria being required, an uninvolved person may also be detected to result in a false detection 215. As such, it is clear that even under more optimal circumstances, the single homogeneous type of information utilizing body detections only, has its drawbacks.
FIG. 2B shows a second set of exemplary body detections 225. The same manner used in determining the body detections 200 in FIG. 2A may be used in determining the body detections 225. In this scene, the body detections 225 again may include the positive detections 205, the missed detections 210, and the false detections 215. However, this scene may represent sub-optimal circumstances to utilize the body detections 225. As can be seen in FIG. 2B, only one position detection 205 is determined. However, all other objects are not detected to result in a high number of missed detections 210. As such, it is further evident that the single homogeneous type of information utilizing body detections only has further drawbacks dependent on the circumstances in which the detections are to be determined.
The above scenes using the body detections 200, 225 illustrate that this method struggles with occlusions and that visual feature based approaches have difficulty, particularly with complex body poses. The criteria in determining a body detection may be greater than when determining, for example, a head detection. The criteria may include the body detection resulting only when the body is within a bounding box. The bounding box may be substantially rectangular that is predetermined based upon a person's body when standing upright. The first scene including the body detections 200 in FIG. 2A illustrates that the people are generally in an upright posture and separate from each other. Thus, there is a relatively high number of positive detections 205. However, the second scene including the body detections 225 in FIG. 2B illustrates that most of the people are not in the upright posture and are not separated from each other, thereby occluding one another. Thus, there is a relatively high number of missed detections 210.
In a more specific embodiment, the detection results that employ three-dimensional geometric primitives to find human-like foreground regions may be used. The bodies may be detected by finding cylinders with plausible width and height for a single person that when projected into the image may match the foreground silhouette. This detector may have high precision with reasonable recall. However, again, it is vulnerable to occlusion and different body postures. For example, if multiple athletes are in proximity of each other and make a single large foreground region, the algorithm may be unable to detect each body of the athletes. Similarly, if a person bends over such that the foreground region is not the same size as the standing person, the detector may again fail. These situations make achieving a high recall difficult.
In view of the drawbacks associated with using only body detections and the probability of having an increased number of missed detections 210, further approaches focused on tracking particular body parts may also be used for tracking purposes. One example of another body part may be the head. The tracking of the other body parts may be based on visual features such as Histograms of Oriented Gradients (HOGs) or edgelets, which may be considered a series of connected pixels that form an edge of an object or portion of an object. Those skilled in the art will understand that part detectors tend to have increased detections but that these detections may have increased false detections.
FIG. 2C shows a first set of exemplary head detections 250. Using this manner of determining head detections, the head detections 250 include positive detections 255, missed detections 260, and false detections 265. Specifically, the head detections 250 may be determined based upon a bounding box. In a first example, the bounding box may be for only the head. In a second example, the bounding box may be for the head and the shoulders. This scene may be the same scene discussed above for the body detections 200 of FIG. 2A. However, when detecting heads, this scene may have an increased number of missed detections 260 due to the heads being cluttered by background (e.g., colors, shapes, etc.). Thus, this illustrates how, in some circumstances, using body detections may still be preferable over head detections. Furthermore, the increased false detections 265 may result simply from a reflection on the floor of an overhanging light source. As such, it is clear that the single homogeneous type of information, utilizing head detections only, has its drawbacks.
FIG. 2D shows a second set of exemplary head detections 275. Again, the head detections 275 may include positive detections 255, missed detections 260, and false detections 265. The manner discussed above with the head detections 250 using the bounding box may again be used. This scene may be the same scene discussed above for the body detections 225 of FIG. 2B. However, when detecting heads in this instance, there is an increased number of detections and positive detections 255. In contrast to the scene of FIG. 2C, there is little to no interference by the background such that the number of positive detections 255 increases. Furthermore, there are no missed detections 260 or false detections 265 in this scene. Therefore, this illustrates how using head detections, in other circumstances, may be far more preferable over body detections. As such, it is evident that the single homogeneous type of information is inconsistent and depends on a variety of factors to improve the number of detections.
In view of the above discussion, the selection of using the single homogeneous type of detection information as body detections or head detections is highly dependent upon the circumstances in which the detections are to be determined. In comparing a single scene with the body detections 200 of FIG. 2A and the head detections 250 of FIG. 2C, the type of detections may be substantially similar. However, in comparing a single scene with the body detections 225 of FIG. 2B and the head detections 275 of FIG. 2D, it is clear that the head detections are preferable when occlusion rates are high. Although the bodies may be occluded, the heads are still visible such that the number of positive detections is also increased. Nevertheless, there is still a greater likelihood of having missed and false detections when relying only on this single type of information throughout.
The body detection information and the head detection information that is generated from the methods described above may be used to generate trajectory information. More specifically, trellis graphs may be formulated from the body detection information and the head detection information. Generally, trellis graphs are graphs including nodes that are ordered into vertical slices. Each node at each time is connected to at least one node at an earlier time and at least one node at a later time. The first or start (S) time and the last or terminating (T) time in the trellis graph may have only one node. For simplicity, the first and last time frames are omitted from a sequential numbering scheme as will be used herein.
FIG. 3A shows a first trellis graph 300 incorporating a single homogeneous type of information. The trellis graph 300 shows a plurality of detections (i.e., vertical slices) along the y-axis for each unit of time represented along the x-axis. The trellis graph 300 represents how the body detection information or the head detection information may be formulated into a trellis graph. As shown in FIG. 3A, there may be a varying number of detections for all frames (e.g., time). As such, there are discrete time steps and a finite, fixed number of steps. Given this basis, a search may be performed to determine a tracking of an object based on a set of body detections or head detections. The track may be the shortest path through the trellis graph 300 that only considers transitions from a first time t−1 to a second time t. The trellis graph 300 shows one such shortest path. It should be noted that further transitions may be considered such as from t−2 to t which implicitly includes a cost of a missed detection at t−1.
FIG. 3B shows a second trellis graph 305 incorporating only body detection information. That is, the body detection nodes are represented as circles. FIG. 3C shows a third trellis graph 350 incorporating only head detection information. That is, the nodes are represented as squares. As discussed above, the trellis graph may include interconnections to nodes prior and subsequent to the respective time unit. Since the trellis graphs 305, 350 relate to using only body detection information or only head detection information, the discrete and finite characteristics may remain. Therefore, each interior node (i.e., non-edge node) at time t is shown as connecting to each subsequent node of the time t+1. Furthermore, Dijkstra's algorithm may be applied to each trellis graph 305, 350 to determine a shortest path 310, 355, respectively, to track a desired person either through the head or the body, respectively.
As discussed above, the tracking of an object may provide information for subsequent processes such as determining a trajectory of the object via the trajectory engine 135. FIG. 4A shows a first set of determined trajectories 400 based upon homogeneous object tracking information. FIG. 4B shows a second set of determined trajectories 450 based upon homogeneous object tracking information. Thus, the head or body detection information may provide tracking information such that trajectory information may be generated. In FIG. 4A, a trajectory 415 may be determined from a start position 405 to an end position 410. In FIG. 4B, a trajectory 465 may be determined from a start position 455 to an end position 460; a trajectory 480 may be determined from a start position 470 to an end position 475; and a trajectory 495 may be determined from a start position 485 to an end position 490.
FIG. 5A shows trajectory results 500 based upon head only homogeneous object tracking information. FIG. 5B shows trajectory results 550 based upon body only homogeneous object tracking information. As discussed above, the head detection information may provide improved precision and recall. The results 500 and 550 illustrate how the head detection information indeed provides this improved precision and recall. Specifically, the output trajectories in the trajectory results 500 are longer which leads to the greater precision and recall. However, again, this still may not be within a desired range of precision and recall. Thus, by improving the tracking information that forms the basis of generating the trajectory results, the trajectory results may also be improved.
A part based detector may be trained for head and shoulder detection. Each of the head, left shoulder, and right shoulder are modeled by a HOG filter and its binary displacements are modeled by a quadratic function of displacement in the x and y directions. An implementation of articulated pose estimation with flexible mixtures of parts may be used and modified for training the detectors. Each part may contain a mixture model corresponding to different postures of the part. The weight of the HOG filters and displacements may be learned by structured Support Vector Machines (SVM). A conventional dataset such as the “Leeds Sports” pose dataset may be used to train the detectors since a similar variety of head and shoulder postures exist in the illustrated scenes discussed above.
When the information from body detectors has been received, further processing of this information may be performed. Specifically, the entire objects are directly detected. Those skilled in the art may understand this approach as a “root” filter in deformable part models (DPMs). Frames may be processed either independently by extracting visual features such as HOG or sequentially through background subtraction based measurements.
One manner of potentially taking advantage of body detection information and head detection information is based on DPM in which the two approaches are combined such that a root filter searches for the whole body and its confidence is combined with multiple part filters connected to the root filter using spring-like potentials. Although the DPM may have provided an improvement over root-only and part-only approaches, there is still difficulty with occlusion and vastly different body poses.
These methods use a variety of detection techniques to produce a homogeneous set of detections typically of the full body which may be determined either directly or by fusing multiple parts. Therefore, the data association tracking algorithm is unable to rectify any incorrect fusion of parts when estimating the existence of a full body detection. Another approach included tracking both parts and complete object detections. However, online tracking is performed and greedy assignment algorithms are used to associate parts to objects as well as tracking objects across consecutive frames. Still further, part and full body detections in an offline data association tracking framework may be used. Specifically, an approximate inference algorithm may be employed that first generates full body tracklets and then the inferred parts for each full body detection. If the parts trackers do not give significant support for a full body tracklet, the tracklet is split. The full body detections are merged using network flow to ensure each full body tracklet has a consistent set of part tracklets.
Once the body detection information and the head detection information has been generated, normalized cross correlation may be used to track each head patch for a short period of time (e.g., one second). In this way, short temporal gaps may be completed if, for example, a head is not detected for a few frames. For the body detections, a Hungarian algorithm may be used to make short term associations between frames for a short batch of frames. After computing short trajectories of both heads and bodies, the velocity of each target may be estimated using a constant velocity model for a predetermined number of consecutive frames (e.g., five consecutive frames).
Subsequently, greedy consecutive shortest path methods may be used to estimate the long term associations of heads and bodies in isolation. Such isolation of heads and bodies were described above with regard to the trellis graph 305 for the bodies and the trellis graph 310 for the heads. This method first finds a shortest path through a network. In a subsequent iteration, the nodes in the previous path are removed from the network and the next shortest path is found in the remaining graph. This iterative method is greedy and assumes that previous paths are all correct and optimum. Thus, if the current shortest path makes a mistake in an iteration, it cannot correct a previous path in future iterations.