Video based human tracking is an important task for many applications such as video surveillance, human computer interaction and video content retrieval. Two-dimensional (2D) tracking techniques have been developed where tracking is based on a single video and provides only trajectories of 2D image coordinates. One of the inherent difficulties for such system is an inability to handle large occlusions in crowded scenes. In addition, 2D approaches are not suitable for applications such as human behavior analysis and event detection, because these applications generally require to know the physical attributes in the 3D world (such as 3D location, velocity and orientation) of the tracked person.
Intuitively, these shortcomings can be overcome by using additional videos from different views (3D human tracking). FIG. 1 shows one exemplary set up for 3D human tracking. In FIG. 1, two video cameras Video 1 and Video 2 captures different views of the same region, but from different positions. Video 1 captures images along trajectories 1 and 2. Correspondingly, Video 2 captures images along trajectories 21 and 22.
As illustrated in FIG. 1, at any single frame if the same person is detected in multiple views, rays that connect the camera optical center and the person's image location in each view should, ideally, intersect in 3D space. This not only gives the 3D location of the person but also imposes strong constraint on the legitimacy of the 2D locations (and thus provides feedback to the human detection result) because a wrong location can not intersect with others correctly. The constraint for matching 2D tracking trajectories are even stronger because each additional frame adds additional constraint. It is possible that at single frame human detection from one view may have a wrong match in other views, but the possibility of such mistake drops significantly when a trajectory becomes long enough.
Despite the simplicity of the idea, 3D tracking has received comparatively little attention in the research community, largely due to the extra complexity added to the already complex tracking problem. One problem is the establishment of correspondence between the features in multiple views. Although simple geometric constraints such as planar homography has been exploited, these constraints are not able to provide the actual 3D location of the tracked person. Another issue that follows naturally is the choice of features used for establishing the correspondence. A common approach uses extracted foreground blobs and assumes that the bottom of a blob corresponds to the foot position of a person. With a calibrated camera and a further assumption that the person is standing on the ground plane (or somewhere with a known altitude), a transformation between an image and the 3D world can be determined even from a single view. These approaches rely heavily on background subtraction results, which is a well known difficult problem in itself. In many cases, an extracted blob may not correspond to any real person or a single blob may contain multiple contiguous persons. An even worse situation is that in a crowded scene, as illustrated in FIG. 2, a person's feet may not be visible at all due to occlusion. Alternatively, the system can detect human heads and use their locations as the feature because in a typical surveillance camera setup, human heads are usually visible even in a crowded scene as the one shown in FIG. 2. FIG. 2 shows a crowded scene, where a person's feet may be severely occluded or even invisible, but his/her head is usually visible.
A 2D tracking technique called Multi Hypothesis Tracking (MHT) can be applied, but the technique is complex to process—the MHT system has to maintain a sufficient number of hypotheses, i.e. possible temporal correspondences between observation across different frames. However, this number may grow exponentially over time when the number of targets in the scene is large and thus results in an intractable complexity. The situation worsens when applying MHT to the 3D tracking problem. In real world cases, due to image noise and observation error, the rays mentioned earlier may never perfectly converge to a single 3D point. It is very likely that 2D points from different views are associated incorrectly and this ambiguity in spatial correspondences adds another level of complexity to the problem.