Research has produced a large body of work concerning the use of video cameras for detecting and tracking people. Vision-based person tracking remains such an active field of research because it is still poorly solved and is beset by many difficult challenges, among the most significant of which are:
Foreground segmentation: Many person tracking methods rely on a preliminary step that separates novel or dynamic objects in the video (“foreground”) from the rest of the scene (“background”), so that further, more expensive analysis may be focused on them. It is very difficult to perform this segmentation robustly in the presence of changing lighting conditions, dynamic background objects (such as moving foliage), shadows and inter-reflections, similar coloring between the foreground objects and the background, and occasional changes to “static” background objects (such as the moving of a chair).
Person discrimination: Typically, not all novel or dynamic objects segmented as foreground are people. They might also be cars, animals, shopping carts, or curtains blowing in the wind, among other things. Person tracking systems must distinguish people from other types of objects, and cannot simply rely on motion cues.
Occlusions: When people are temporarily blocked from the camera's view by static objects in the scene or by other people, tracking systems frequently err. For example, they often swap the identities of tracked people who occlude each other while crossing paths. In addition, when a person passes behind a large object and then re-emerges, tracking systems often fail to associate the emerging person with the one who disappeared.
Track confusion: When tracking several people simultaneously, systems often struggle to associate a constant identity with each tracked individual, even when occlusions are somewhat minimal. For instance, people can approach each other closely, perhaps holding a book together or embracing. They can quickly change appearance, perhaps by removing a hat or a bag, or by simply turning around. They can also change their velocity rapidly. All of these factors create great difficulties for algorithms that rely heavily on appearance feature matching or trajectory prediction.
While most vision-based person tracking methods operate primarily on color or grayscale video, interest in augmenting this input space with depth (or disparity) imagery has grown as hardware and software for computing this data from stereo cameras has recently become much faster and cheaper. Depth data has great potential for improving the performance of person tracking systems because it:                Provides shape and metric size information that can be used to distinguish people from other foreground objects;        Allows occlusions of people by each other or by background objects to be detected and handled more explicitly;        Permits the quick computation of new types of features for matching person descriptions across time; and        Provides a third, disambiguating dimension of prediction in tracking.        
Several person detection and tracking methods that make use of real-time, per-pixel depth data have been described in recent years. Most of these methods analyze and track features, statistics, and patterns directly in the depth images themselves. This methodology is not as fruitful as one might hope, however, because today's stereo cameras produce depth images whose statistics are far less clean than those of standard color or monochrome video. For multi-camera stereo implementations, which compute depth by finding small area correspondences between image pairs, unreliable measurements often occur in image regions of little visual texture, as is often the case for walls, floors, or people wearing uniformly-colored clothing, so that much of the depth image is unusable. Also, it is not possible to find the correct correspondences in regions, usually near depth discontinuities in the scene, which are visible in one stereo input image but not the other. This results in additional regions of unreliable data, and causes the edges of an object in a depth image to be noisy and poorly aligned with the object's color image edges.
Even at pixels where depth measurements typically are informative, the sensitivity of the stereo correspondence computation to very low levels of imager noise, lighting fluctuation, and scene motion leads to substantial depth noise. For apparently static scenes, the standard deviation of the depth value at a pixel over time is commonly on the order of 10% of the mean—much greater than for color values produced by standard imaging hardware.
To combat these problems, some very recent person tracking methods have been based not on analysis of the raw depth images, but instead on the metric shape and location information inherent in the original “camera-view” depth images to compute occupancy maps of the scene as if it were observed by an overhead, orthographic camera.