Video surveillance is currently a fast-growing market tending to become increasingly widespread for ubiquitous applications. It can be used today in numerous areas such as crime prevention, private and public areas for security purposes, abnormal events detection, traffic monitoring, customer behaviour, or general data gathering.
Despite an ever-increasing usage, mainstream video-surveillance has strong inherent limitations which lead to poor performance, in particular for solving crimes and offenses, due to the way it is used. Basically, video-surveillance consists in streaming camera footages to be recorded and displayed in real-time to human operators. Unfortunately, only a very limited fraction of camera images can be seen in real-time by humans, the remaining footage recordings being used after-action for batch or forensic activities. However, such a forensic after-action viewing is, in practice, rarely used, both, because it is often too late and useless at this point, and also because it is a time-consuming task to retrieve and track people like offenders across images from several cameras.
To cope with such difficulties, Video Content Analysis software (VCAs) modules have been developed to perform some automatic video analysis so as to trigger alarms, to make video surveillance far more real-time responsive, and to make it easier to exploit the after-action recorded footages, for example for forensic activities or for batch analysis tasks.
Tracking VCAs are used in many applications of video-surveillance, in particular for security applications. A main object of tracking VCAs consists in detecting and tracking target individual displacements (such as humans or vehicles).
Tracking VCAs can be implemented in different system architectures such as mono-camera tracking, multi-camera tracking, and re-identification.
Mono-camera tracking basically consists in tracking individual targets displacements in the field of view of individual cameras while multi-camera tracking (also known as overlapping fusion) aims at tracking individual target displacements, when they are in the field of view of several different cameras at the same time (the cameras share a partly common field of view), and re-identification (also known as non-overlapping fusion, or sparse cameras tracking) is directed to tracking individual target displacements across several remote cameras which do not share a common field of view.
Mono-camera tracking technology has seen impressive progress in the last couple of years due to the introduction of machine-learning-based innovative methods. In particular, there exist very efficient mono-camera tracking algorithms for human detection, based on these methods. They make it possible to perform robust and real-time detections and tracking.
Most of the current mono-camera tracking algorithms are able to use positions, trajectories, and advanced appearance cues to solve tracking issues.
Although mono-camera tracking provides reliable results, tracking errors are unavoidable.
Overlapping of the fields of view (FoV) of cameras in a video-surveillance system used for tracking objects, for example for tracking people in streets, makes it possible to increase tracking accuracy and to solve occlusion problems that may occur in a scene when a tracked object is hidden by another object.
More precisely, a main goal of using cameras having overlapping fields of view is to track objects by combining data from a set of overlapping cameras (overlooking at least partially the same scene, i.e., possibly with a partial overlap of their FoV) and to establish a correspondence across multiple views (track assignment).
There exist solutions derived from the ones implemented in networks of radars used for tracking planes. According to these solutions, the tracking results obtained by individual radars are combined in a data fusion algorithm. Such techniques can be used within networks of cameras to track targets based on fusion of visual features obtained from different cameras.
However, a problem with such an approach is that it is mainly based on location and trajectory cues. Therefore, it requires a very thin calibration of the cameras, so that the positions of the pixels in the images obtained from these cameras are associated very accurately with real-world positions. There exist also high risks of confusion when targets are close one to another or when there are occlusions. Though many occlusion risks are suppressed through the use of several overlapping cameras with different points of view, there is still a high risk of confusion between targets when they are close. Finally, such a solution requires multiple cameras sharing common field of view from different points of view to decrease the risk of occlusions, which is expensive.
To increase the reliability of such methods, tracking of objects through several cameras can be further based on data correspondences between images acquired from different cameras. To that end, features are extracted from images acquired by the cameras of a video-surveillance system and next, they are compared. Such features can be, for example, color histograms. Accordingly, tracking of objects is determined as a function of data correspondences between images and of the relative positions of the cameras from which the images are obtained.
Unfortunately, the results obtained using solutions based on such a method are of poor quality since the cameras have different fields of view, poses, image properties, optics, and so on, that make the use of appearance-based features quite meaningless.
Consequently, there is a need for improving target tracking accuracy using images from several cameras.