Tracking an object in a sequence of images of a video is performed in many computer vision applications. Tracking locates a region in each image that matches an appearance of a target object. Object tracking is most frequently performed with a single camera. However, one fundamental limitation of using one camera is dealing with occlusions, and accurately determining depts. With single-camera methods, occlusion can be detected on a per-pixel basis, or the position of the object can be predicated.
The problem of occlusion is addressed in several different ways. In the case of tracking with a single camera tracking, one can treat the problem implicitly or explicitly. Implicit methods use filtering methods such as Kalman filtering or particle filtering to predict the position of tile occluded object. Explicit methods often use a generative model, such as video layers or incorporate an extra hidden process for occlusion into a dynamic Bayesian network to interpret the image and to explicitly model occlusions.
With multiple cameras, one can solve the occlusion problem at the cost of introducing correspondence and assignment problems. That is, inmost conventional multi-camera systems represent the scene as a collection of ‘blobs’ in 3D space, which are tracked over time. This requires finding the corresponding blobs across multiple images, i.e., the correspondence problem, as well as assigning 2D blobs to the current 3D blobs maintained by the system i.e., the assignment problem.
However, arranging a multi-camera system in a geometrically complex outdoor scene may be difficult. Multiple cameras can increase the field of view of tracking systems, as well as enable triangulation of 3D positions. However, the presence of significant occlusions is still an issue.
A stereo camera can also be used for object tracking. In that case, depth is typically used as another channel in the images, and tracking is performed on a four channel image including the R, G, B colors and depth.
However, conventional stereo methods might find it difficult to obtain useful and reliable depth estimates in occluded regions, Vaish et al., “Reconstructing occluded surfaces using synthetic apertures: Stereo, focus and robust measures,” CVPR 06, pages 2331-2338, 2006. They use an array of 128 cameras that is only suitable for studio settings. Their results showed that stereo reconstruction performance falls off as the amount of occlusion increases, with generally poor results with greater than 50% occlusion. It is desired to track objects in scenes with greater than 50% occlusion.