With the increased focus on the surveillance of both private and public areas, there has been a substantial increase in the numbers and sophistication of cameras and sensors that are utilized to monitor these extended areas. An example of such a surveillance task is the monitoring of an airport or train station which may involve many hundreds of cameras each potentially providing a live feed to a centralized monitoring station where they are viewed by human operators. The operators may have a number of possible goals, such as observing customer behaviour, or identifying threats to the public or infrastructure.
This ability to employ multiple cameras has been facilitated by the use of real time digital video cameras which transfer their live image information via standard network protocols such as Internet Protocol (IP), thereby making the addition of further cameras to a pre-existing network as easy as connecting an IP camera to a central hub whether by wireless means or directly by cable. The IP camera is then provided either with a dynamic or static allocated IP address and can commence streaming of live video data in an extremely short time.
However, whilst this ease of being able to increase the number of cameras in a network surveillance system implies that more extended areas may be monitored at increased resolution, the large amount of incoming video information that is streamed to a centralized monitoring station quickly results in information overload when this information is being viewed by human operators. Accordingly, security personnel that are tasked to monitor this information are not able to effectively monitor these extended areas.
To address these shortcomings of large scale network surveillance systems, data analysis methods have been developed which attempt to analyse the incoming video information to determine if the behaviour of objects or people being viewed varies from “normal.” This is with a view to presenting monitoring personnel with video information of those behaviours which have been initially classified as abnormal. To this end, these systems, which may be a combination of hardware and software, attempt to generate an understanding of the paths or tracks which “targets” may take between the fields of views of each of the cameras.
This “activity topology” information is accordingly the foundation for many fundamental tasks in networked surveillance, such as tracking an object across the network. In order to derive the activity topology of a network of cameras, the aim is not only to estimate relative positions of surveillance cameras with overlapping fields of view, but also to characterise the motion of targets between non-overlapping pairs of cameras. Although in principle the activity topology could be derived manually for small sets of cameras, this approach clearly does not scale to large network surveillance systems, where individual cameras may frequently be added, malfunction or moved.
There have been a number of approaches in the prior art that attempt to estimate the activity topology of a network of cameras. Typically, these approaches either require training data, such as the correspondence between paths or tracks in different images or camera views, to be supplied a priori or rely on observing the motion of targets for extended periods of time as they move through the area viewed by the network of cameras. This is in order to accumulate appearance and disappearance correlation information in an attempt to estimate the path that a target will take.
These methods all rely on either human intervention or observing and analysing large amounts of video data in order to determine the activity topology. This problem is complicated by the fact that comparisons must be made between every pair of cameras in a network. As the number of pairs of cameras grows with square of the number of cameras in the network, these techniques that are based on exhaustive pair wise comparisons of large volumes of data soon become infeasible.
Another class of methods estimate the calibration and orientation information relating each of the cameras in the network to a common frame of reference on the basis of commonly viewed features. These methods do not characterise the activity topology, are susceptible to failing to find the required number of common image features, and rely on large overlaps and complete connectivity between fields of view.