Effortless navigation across cameras: In video surveillance, following a suspect roaming across multiple cameras in a large indoor or outdoor environment can be very challenging.
Using a traditional surveillance solution, an operator must first memorize the layout of the facility (i.e. map) and the location of the cameras. Surveillance cameras often have a pan-tilt-zoom (“PTZ”) capability, meaning that they can rotate arbitrarily and zoom to see far-away details. Remembering what each camera can see takes a significant effort.
Tracking a suspect across a hallway with multiple branches offers very little margin for error and takes most of the concentration of a security operator, making it hard to think about high-level questions such as “does this suspect present a real threat and if so, what is this suspect trying to achieve?” and preparing a response (e.g. calling a security guard on site).
PTZ cameras have a limited field of view (e.g. 30 degrees when fully zoomed out) and therefore generally point in the wrong direction. When switching to a PTZ camera in a traditional surveillance solution, an operator must then manually move it (e.g. by selecting a preset from a list, or using a joystick), wasting many precious seconds during which the suspect may turn a corner and get out of sight.
Using traditional solutions, recovering a suspect that gets out of sight is highly problematic. An operator must mentally identify the cameras that the suspect is likely to go through eventually. He must then cycle through these cameras rapidly, hoping to find him when he passes by. It may take minutes for the suspect to reappear. PTZ cameras can be redirected, but because a human can at best look at a handful of cameras at once, in practice operators have a very hard time recovering suspects that got out of sight.
All of these problems are compounded by factors like stress and fatigue. The margin for error when following an armed and dangerous suspect is extremely thin, and security guards often go through hours of monotony making them prone to mistakes and inattention. The ideal solution must require no memorization or concentration, and allow fast and easy recovery in case of mistakes.
An ideal solution goes beyond simply tracking a suspect, and enables the operator to navigate freely in the environment, e.g. to move around and be generally aware of the current situation in a specific area. Another common need is to enable virtual visitors to navigate inside and participate in a virtual reproduction of a building, tradeshow, shopping center or city.
The ideal solution must also work equally well in real-time situation and during investigations of events and video sequences that have been archived. The solution must work reliably across a wide range of environments including facilities with multiple floors (transitions between floors, inside/outside), navigation across city-blocks where cameras may be on roofs, walls, moving vehicles (Unmanned Aerial Vehicle, car, elevator, bus, train), and so on.
Most video surveillance solutions offer a 2D map to users. Using a map can help to identify possible cameras of interest, but constantly switching attention to/from videos/map distracts operators and increases the chance to miss suspicious activity. When switching cameras, humans also tend to oversimplify the problem and rely on simple cues like geographic proximity, i.e. cameras that appear close to the last seen position of the suspect as seen in the map in 2D. Such simple criteria are not optimal to identify relevant cameras. For instance, high-end PTZ cameras can zoom 30× and therefore, a far-away camera that can point in the right direction often offers superior view of the action.
Countless prior art focuses on the general problem of navigating in a 2D or 3D environment. Relatively few constrain the problem to images and videos taken by nearby locations. Techniques published by Microsoft Photosynth and Noah Snavely (e.g. [Finding paths through the world's photo—Siggraph 2008]) assume a high density of nearby cameras, and gradual small changes in camera position and orientation. They do not work reliably when the camera coverage is very sparse and orientations differ significantly, the common scenario in video surveillance applications. They also do not explicitly handle occluders like walls.
Some commercial solutions advertise capabilities to simplify tracking of suspects. They use overly simplistic solutions such as presenting up-down-left-right buttons to the operator, which once clicked, switch to other cameras in a hard-coded fashion. In practice, these techniques are of limited use. For instance, they do not work when multiple hallway branches are visible, they do not take full advantage of PTZ camera capabilities, they do not work with panoramic cameras, they require extensive setup time, and they do not handle cameras that translate (e.g. in an elevator).
There are automated techniques to track suspects across one more multiple cameras, but they all suffer from many drawbacks. For instance, high-end PTZ cameras often include a so-called auto-tracking feature. This feature typically relies on a simple background subtraction [https://computation.llnl.gov/casc/sapphire/background/background.html] to identify movement in the scene, and moves the cameras to keep the movement in frame. This solution, while occasionally reliable in simple scenarios like a single person moving without occlusion in front of a camera, does not handle transition across multiple cameras, crowds, objects moving naturally (e.g. water, trees affected by the wind), etc.
More complex video analytics methods try to separate the suspicious person or object from the rest of the movement, but all known techniques are unreliable in complex real-life scenario, e.g. large number of people walking in multiple directions in a possibly dynamic environment (snow, rain, smoke). For the time being at least, only humans can make intelligent decisions to follow a specific individual in a crowded scene.
Tracking can also be performed by identifying individuals, e.g. through biometrics like facial recognition, RFID tags, and GPS sensors. These techniques all suffer from limitations. Facial recognition techniques require a good view of the face and no known method is perfect, so false negatives and false positives are very frequent even in ideal scenarios. RFID and GPS require extra hardware and often the cooperation of the individual being tracked. None of these solutions provide much control to the operator when he desires to navigate without actually tracking a specific individual, to simply be aware of nearby activity.
There is thus a need for a more effective method for navigating across multiple images or videos related geographically, especially for the case of following a suspect using video surveillance cameras.
Cooperative Control of Cameras: A related challenge is the effective and intuitive monitoring of a large outdoor area. Monitoring a large outdoor area (e.g. dozens or hundreds of cameras surrounding a facility) is challenging because each camera only gives limited a point of view. Operators often suffer from a “tunnel effect” because they only see a small amount of information at a time.
Most 2D video surveillance solutions used in practice do not provide sufficient spatial context, i.e. it is not clear how each camera is related to others. For instance, if an individual in a crowd is pointing at another person visible in another camera, it is very hard for a human to immediately grasp who the person is pointing at, because both cameras are presented separately and traditional solutions do not present an intuitive mapping between the two.
The Omnipresence 3D software application includes a 3D video fusion capability to display many real-time or archived videos realistically on top of a 3D map. (This is sometimes also referred to as 3D video draping or 3D video projection.) For each pixel in a 3D map viewport, a calculation is made to identify which fixed, panoramic or PTZ cameras has the best view of that pixel, and the 3D fusion is performed according to the precise, continuously-updated position, direction and field-of-view (“FOV”) of each camera. This provides spatial context, since it is immediately clear how two cameras visible in the 3D viewport are interrelated spatially, and it reduces the “tunnel effect” problem since cameras that point close to each other are automatically “stitched” in 3D to provide a panoramic view.
One limitation to this 3D fusion approach is that PTZ cameras may not be pointing in the optimal locations. A simple solution consists in providing simple user control, e.g. having the user click on a location in the 3D map, and having the system identify and redirect one or a few PTZ cameras that can see that location.
The approach is limited because each PTZ camera is handled independently. An ideal system would focus each PTZ camera on a different area of the 3D map to provide an optimal coverage, referred hereon as a Cooperative Camera Control for an Optimal 3D Map Coverage (C3DM). Each PTZ camera would complement the other, to provide to the operator the optimal representation of the action occurring in the 3D map, as if he was surveying that 3D space from an aerial viewpoint. This would occur automatically and in real-time as the operator moves around the 3D map in the 3D map viewport.
Self-Healing Perimeter: Large critical-security facilities often have a long perimeter defined by fences, natural barriers (e.g. cliffs) and water front. One popular security design approach consists in dividing the perimeter into perimeter segments, and assigning one camera on or near the perimeter to monitor that perimeter segment constantly.
Occasionally, one or some of these cameras are broken, obscured (e.g. by cargo ship, rain, sun glare), disconnected or otherwise unusable. When this happens, there is a gap in the perimeter that can be exploited by burglars, illegal immigrants or drug traffickers.
Critical facilities often use PTZ cameras, or fixed cameras on top of pan-tilt heads, as a solution to this problem. A human can choose a PTZ camera and redirect it to cover the gap. The problem is that, at best, it may take minutes for a human to identify the problem and address it. In practice, in facilities that have lots of cameras and more lax procedures, it is more likely that it will take days or even weeks for the problem to be identified and addressed.
The ideal solution would monitor all cameras and within seconds, identify when a camera is tampered with or unusable. The system would then automatically identify one or more PTZ cameras that cover the gap(s) optimally.
There is thus a need for more effective methods for camera control, especially for the cases of cooperatively controlling multiple PTZ cameras.