Video systems are well known in the field of security systems. In a typical security system one or more video cameras are placed to provide a field of view of the area under surveillance. These video cameras convert a visual image into electronic form suitable for transmission, recording or analysis. When the security system includes a network of cameras, tracking across cameras with non-overlapping views is a challenging problem. Firstly, the observations of an object are often widely separated in time and space when viewed from non-overlapping cameras. Secondly, the appearance of an object in one camera view might be very different from its appearance in another camera view due to the differences in illumination, pose and camera properties.
There has been a major effort underway in the vision community to develop fully automated surveillance and monitoring systems. Such systems have the advantage of providing continuous active warning capabilities and are especially useful in the areas of law enforcement, national defense, border control and airport security.
One important requirement for an automated surveillance system is the ability to determine the location of each object in the environment at each time instant. This problem of estimating the trajectory of an object as the object moves around a scene is known as tracking and it is one of the major topics of research in computer vision. In most cases, it is not possible for a single camera to observe the complete area of interest because sensor resolution is finite, and the structures in the scene limit the visible areas.
Therefore, surveillance of wide areas requires a system with the ability to track objects while observing them through multiple cameras. Moreover, it is usually not feasible to completely cover large areas with cameras having overlapping views due to economic and/or computational reasons. Thus, in realistic scenarios, the system should be able to handle multiple cameras with non-overlapping fields of view. Also, it is preferable that the tracking system does not require camera calibration or complete site modeling, since the luxury of fully calibrated cameras or site models is not available in most situations.
In general, multi-camera tracking methods differ from each other on the basis of their assumption of overlapping or non-overlapping views, explicit calibration vs. learning the inter-camera relationship, type of calibration, use of 3 D position of objects, and/or features used for establishing correspondences. The multi-camera tracking art is broken into two major categories based on the requirement of overlapping or non-overlapping views.
Multi-Camera Tracking Methods Requiring Overlapping Views:
A large amount of work on multi-camera surveillance assumes overlapping views. R. Jain and K. Wakimoto. “Multiple perspective interactive video” (1995) IEEE International Conference on Multimedia Computing and Systems, used calibrated cameras and an environmental model to obtain 3 D location of a person. The fact that multiple views of the same person are mapped to the same 3 D location was used for establishing correspondence. Q. Cai and J. K. Aggarwal, “Tracking human motion in structured environments using a distributed camera system” (1999), IEEE Trans. on Pattern Analysis and Machine Intelligence, 2(11): 1241-1247, used multiple calibrated cameras for surveillance.
Geometric and intensity features were used to match objects for tracking. These features were modeled as multi-variate Gaussians and the Mahalanobis distance measure was used for matching. Ting-Hsun, Chang, and Shaogang Gong. “Tracking multiple people with a multi-camera system” (2001), IEEE Workshop on Multi-Object Tracking, discloses use of the top most point on an object detected in one camera to compute its associated epipolar line in other cameras. The distance between the epipolar line and the object detected in the other camera was used to constrain correspondence. In addition, height and color were also used as features for tracking.
The correspondences were obtained by combining these features using a Bayesian network. S. L. Dockstader and A. M. Tekalp. “Multiple camera fusion for multi-object tracking” (2001), IEEE Workshop on Multi-Object Tracking, also used Bayesian networks for tracking and occlusion reasoning across calibrated cameras with overlapping views. Sparse motion estimation and appearance were used as features. A. Mittal and L. S. Davis “M2 tracker: a multi-view approach to segmenting and tracking people in a cluttered scene” (2003), Int. Journal of Computer Vision, 51(3): 189-203 used a region-based stereo algorithm to estimate the depth of points potentially lying on foreground objects and projected them on the ground plane. The objects were located by examining the clusters of the projected points. In Kang et al “Continuous tracking within and across camera streams” (2003), IEEE Conf. on Computer Vision and Pattern Recognition, a method is disclosed for tracking in stationary and pan-tilt-zoom cameras.
The ground planes in the moving and stationary cameras were registered. The moving camera sequences were stabilized by using affine transformations. The location of each object was then projected into a global coordinate frame, which was used for tracking. An approach for tracking in cameras with overlapping field of views (FOV) that did not require explicit calibration is disclosed in L. Lee, R. Romano, and G. Stein. “Monitoring activities from multiple video streams: Establishing a common coordinate frame” (August 2000), IEEE Trans. on Pattern Recognition and Machine Intelligence, 22(8): 758-768. The camera calibration information was recovered by matching motion trajectories obtained from different views and plane homographices were computed from the most frequent matches. Explicit calibration was avoided in S. Khan and M. Shah.
“Consistent labeling of tracked objects in multiple cameras with overlapping fields of view” (2003), IEEE Trans. on Pattern Analysis and Machine Intelligence, 25, by using the FOV line constraints to handoff labels from one camera to another. The FOV information was learned during a training phase. Using this information, when an object was viewed in one camera, all the other cameras in which the object was visible could be predicted. Tracking in individual cameras was needed to be resolved before handoff could occur. Most of the above mentioned tracking methods require a large overlap in the FOVs of the cameras. This requirement is usually prohibitive in terms of cost and computational resources for surveillance of wide areas.
Multi-Camera Tracking Methods for Non-Overlapping Views:
To track people in an environment not fully covered by the camera fields of view, Collins et al. developed a system consisting of multiple calibrated cameras and a site model. See R. T. Collins, A. J. Lipton, H. Fujiyoshi, and T. Kanade, “Algorithms for cooperative multi sensor surveillance” (2001), Proceedings of IEEE, 89(10): 1456-1477. Normalized cross correlation of detected objects and their location on the 3 D site model were used for tracking. T. Huang and S. Russell. “Object identification in a Bayesian context” (1997), Proceedings of IJCAI, presents a probabilistic approach for tracking vehicles across two cameras on a highway.
The solution presented was application specific, i.e., vehicles traveling in one direction, vehicles being in one of three lanes and solution formulation for only two calibrated cameras. The appearance was modeled by the mean of the color of the whole object, which is not enough to distinguish between multi-colored objects like people. Transition times were modeled as Gaussian distributions and the initial transition probabilities were assumed to be known. The problem was transformed into a weighted assignment problem for establishing correspondence. Huang and Russell, trades off correct correspondence accuracy with solution space coverage, which forces them to commit early and possibly make erroneous correspondences.
V. Kettnaker and R. Zabih. “Bayesian multi-camera surveillance” (1999), IEEE Conf. on Computer Vision and Pattern Recognition, pages 1117-123, discloses use of a Bayesian formulation of the problem of reconstructing the paths of objects across multiple cameras. Their system requires manual input of the topology of allowable paths of movement and the transition probabilities. The appearances of objects were represented by using histograms. In Kettnaker and Zabih's formulation, the positions, velocities and transition times of objects across cameras were not jointly modeled. However, this assumption does not hold in practice as these features are usually highly correlated.
Ellis et al. determined the topology of a camera network by using a two stage algorithm. First the entry and exit zones of each camera were determined, then the links between these zones across seven cameras were found using the co-occurrence of entry and exit events. The system and method of the present invention assumes that correct correspondences cluster in the feature space (location and time) while the wrong correspondences are generally scattered across the feature space. The method also assumes that all objects moving across a particular camera pair have similar speed. See T. J. Ellis, D. Makris, and J. K. Black. “Learning a multi-camera topology” (2003), Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.
Recently, a method was disclosed by A. Rahimi and T. Darrell, “Simultaneous calibration and tracking with a network of non-overlapping sensors” (2004), IEEE Conf. on Computer Vision and Pattern Recognition, to reconstruct the complete path of an object as it moved in a scene observed by non-overlapping cameras and to recover the ground plane calibration of the cameras. They modeled the dynamics of the moving object as a Markovian process. Given the location and velocity of the object from the multiple cameras, they estimated the most compatible trajectory with the object dynamics using a non-linear minimization scheme. Their scheme assumes that the correspondence of the trajectories in different cameras is already known. In contrast, establishing correspondence is the very problem to be solved.
The present invention contributes a system and method to determine correspondences between objects tracked by plural cameras when the tracks are separated in space and time using space-time features and appearance features of the object. Using Parzen windows, spatial temporal probability between cameras is learned and appearance probabilities are learned using distribution of Bhattacharyya distances between appearance models is learned for use in establishing correspondences between camera tracks. Through the method of the present invention, object tracks from plural cameras are automatically evaluated to determine correspondences between tracks, thus tracking an object moving around the area covered by the cameras.
Further objects and advantages of this invention will be apparent from the following detailed description of the presently preferred embodiments which are illustrated schematically in the accompanying drawings.