The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
A large area video sensor network requires efficient coordination of data collected and analyzed by a multiple set of analysis and detection algorithms. The analysis algorithm collects data and builds one or more behavior models for the detection algorithm to perform scoring and fusion functions. These algorithms are often CPU intensive, embedded in the device, and specifically modeled and configured for a fixed camera view.
Low CPU and data processing efficiency results in several limitations. For example, video analysis algorithms are CPU intensive. Also, processing all of the video streams for multiple camera systems is prohibitively expensive. Furthermore, abnormal behavior is hard to define and often hides within the normal data. Yet further, the video analysis algorithms yield very low detection accuracy especially when the motion and background pattern changes in the field of view (FOV) of the camera.
Additional limitations result from a fixed local context. For example, most surveillance systems focus on fixed camera views and/or fixed sensor positions. But when contexts change, the video analysis algorithm needs to adapt to the change quickly. The context has multiple levels of detail. At the image level, lighting and focus needs to be adapted. At the image analysis level, the background analysis model needs to be adapted. Adapting to a new background can be processed using image recognition techniques or using preset models corresponding to the camera views. In the behavior analysis model level, the object motion and background model needs to be adapted simultaneously. The problem of adapting to a new behavior model for abnormal behavior assessment is more complicated because the behavior model of long motion trajectory and human behavior may take a long time to establish and is application and context dependent.
Limitations also result from a lack of context awareness. For example, the surveillance algorithm often runs in a place that is close to the camera to reduce the communication cost. But for a large scale video sensor network, sensor data needs to be fused based on temporal and spatial relationship to obtain a global understanding of the activities. Therefore, having the analytic engine close to the camera does not necessarily solve the dynamic loading and execution of models and fusion engines that are required to adapt to the changing contexts and applications cross multiple cameras.
Further limitations result from implementation as a closed system. For example, an open source computer vision algorithm library is typically compiled and bound to a fixed application and runs in a dedicated machine for a fixed set of cameras. In other words, there is no open interface to support the exchange of a binary library and models to support exchange of proprietary algorithms to track events that cross over multiple geographically distributed areas and administration entities.
Yet further limitations result from a lack of interoperability between analytic systems. For example, surveillance systems work in isolation. Therefore, the models built by surveillance system for different contexts are different and cannot be exchanged quickly to achieve higher efficiency and productivity in model building.
Limitations of today's surveillance systems are readily appreciated with reference to the following example, in which the problem is to accomplish model re-use and sharing within one camera and between cameras. For example, the security guard will change the settings of a PTZ camera due to some investigation and cannot leave the camera to the exact same configuration. In this case, the FOV of the camera changes. The system can prefer not to generate alarms until the online model builder builds a mature model (e.g., prevention of spurious alarm generation). Today's systems are limited because they cannot re-use some part of the previously built models, such as models previously produced by the same camera or by a second camera that has an overlapped field of view with the first camera. As a result, today's systems are not able to evaluate alarm conditions by using partial model data retrieved from the second camera.
Problems caused by these limitations are particularly evident with reference to an example involving multi-purpose rooms (such as big lobbies). In such rooms, the site can have different usages such as for special events. For example, in the case that a temporary reception desk is installed in the area, the usage of the site is changed, which causes false alarms even though the field of view of the camera has not changed. The false alarms are generated because the site usage has been changed and the current model does not capture the change yet. Today's surveillance systems are capable of suppressing the alarms by detecting that the alarm generation is somewhat increased and that the alarms are not acknowledged. However, these systems still initially generate false alarms and later fail to issue genuine alarms. This limitation results from inability of these systems to allow users to share common behavior models and develop multiple, different types of assessment criteria using multiple, user defined scores and score fusion functions such as aggregation and normalization. In other words, today's surveillance systems are not able to summarize a behavior model, query based on the summarization, and receive in response one or more closely matching previously generated models efficiently. As a result, these systems are not able to use such a model for behavior assessment and evaluation, employ an online model builder to share and start revising shared models collaboratively, and develop useful behavior scoring and fusion engines that can also be shared instead of trying to learn from scratch and work in isolation.
Still further limitations result from the inability of today's models to associate motion behavior with multiple different object types based on features extracted from real-time observations without knowing what the objects actually are (i.e, a person, a car, and a truck). For the purpose of detecting abnormal behavior and generate alert, it is not necessary to identify the objects. For example, if the system does not distinguish the moving object type to build models, the behavior detection generates false alarms and misses alarms. For instance, in a traffic scenario, a camera observes an intersection at which persons, cars, and trucks are expected to exhibit different behaviors. For example, pedestrians are expected to use the crosswalk and sidewalks and the speed of which may be slower than the car along the road but faster than the car in the pedestrian crossing. In a warehouse environment, forklift motion is different from workers. For safety reason, alert may be generated when person and forklift are getting too close. When warehouse lay out changes, or the goods in the docking area fills up the behavior of the forklift and workers are different. Therefore, to detect abnormal behavior different models learned in the past may be tested and used to control the amount of the false alarms. When the moving object type is not considered with the motion model, the system cannot automatically detect: (a) person walking/running on the road instead of on the crosswalk and/or the side walk; (b) a vehicle driving on the side walk; (c) a truck driving in the no-truck lane, etc. The inability of today's surveillance systems to handle different object types is caused in part by failure to associate the model with an appearance model. In other words, the systems cannot build object type specific models and execute behavior analyses based on these object type models. It should be noted that building object type specific models based on appearance models does not require recognition of specific object type/classification.
Yet more limitations of today's surveillance systems exist. For example, many of today's surveillance systems do not store models in a decentralized manner (locally by each camera). Also, today's models are not searched (peer2peer search). Further, today's models are not predictive (in case of a panning camera controlled by a script). As a result, today's surveillance systems lack scalability.
For the reasons detailed above, most of today's video analytic systems build statistical models for the background and motion flow patterns. Regarding these models, there are many different models built by different algorithms tailored for different application and environments. Current state of the art systems do teach how to adapt to preset camera PTZ position change and background image changes. However, for the end user, it takes time and effort for selecting and fixing and preset the camera views, selecting algorithms, building the statistical model, and tuning the configuration parameters to achieve acceptable performance. It is unlikely that user can remember and which model and algorithm to use and be able to learn and switch the behavior model in real-time.
Therefore, process and architectures to support behavior model level sharing and fast switching is an open area for surveillance system design that needs further work.