Video surveillance systems are used to capture video content of a number of public, private, and government locations. For example, video surveillance system are commonly used in airports, train stations, stores and shopping centers, factories, and other locations with the presence of people, vehicles, etc. The cameras can capture extensive amounts of video content and the content may be recorded and stored by the surveillance system for a period of time so that the past presence of people, vehicles, etc. can be identified. Manually searching recorded video content captured by a video surveillance system can be extremely labor intensive and time consuming. Video analytics algorithms have been developed that can be used to extract high-level information from the video content captured by the video cameras have been developed. The video analytics algorithms can be used to identify objects in this video content that has been captured. Objects in the video content must be identified and classified if a user would like to be able to conduct searches on the video content. For example, a user may wish to search for video content that shows vehicles entering or exiting a facility during a predetermined period of time. If the objects, such as vehicles and people, have been identified in the captured video content, search algorithms can be used to identify potentially relevant content without requiring that the user manually review all of the video content captured during the period of interest.
Video analytics algorithms can help to automate object classification. Object classification can include several aspects: (1) feature calculations and (2) feature-based classification. In general, various object features can be used for object classification. An example of conventional approach to object classification can be found in U.S. Pat. No. 7,391,907, titled “Spurious Object Detection in a Video Surveillance System,” to Venetianer et al., which discusses a system that uses comprehensive set metrics of related to object features including: a shape consistency metric, a size consistency metric, a size metric, a texture consistency metric, a color consistency metric, a speed consistency metric, a direction of motion consistency metric, a salient motion metric, an absolute motion metric, and a persistent motion metric. Without camera calibration information, it is difficult to effectively take into consideration all these metrics for object classification. As a result, only a few selected features are usually used in practical applications. The features which are commonly used in conventional systems are object size, and object aspect ratio (height vs. width), and object shape. For example, the object aspect ratio of a person and a car are usually very different and can serve as a discriminative feature to distinguish between person and a car in video content. Object aspect ratio can be viewed as a simplified shape feature if it is treated as an approximation of the ratio of the major axis length and the minor axis length of the fitted ellipse of an object.
Sophisticated features are of more and more interest in the computer vision society for object detection, classification, including wavelets (i.e., Haar feature), bag of visual words, scale-invariant feature transform (SHIFT) features (or its simplified version SURF), HoF (histogram of optical flow), HoG (histogram of oriented gradients). These features have been proven effective theoretically in a broad range of applications include video surveillance. However, so far very few practical systems have existed in the video surveillance domain, which employ these features, which could be due to the complexity, inefficiency, or unsuitability.
Published U.S. Patent Application No. US2010/0054535A1, titled “Video Object classification,” to Brown et al. discusses computing the difference of histograms of oriented gradients (HoG) for each tracked object over video sequence, and monitor the deformation level between vehicles and people (the level of deformation of a person is considered higher than that of a vehicle), and classify tracked objects through a Maximum A Posterior (MAP) approach. This approach requires objects to be tracked and classified have a reasonable large size to allow the calculation of histograms, which is not suitable to applications where objects are small or far from cameras. This method requires calibration information. Also, due to the use of MAP, likelihood and prior probabilities over the scene are required at the beginning for each object type, which is impractical for many surveillance applications involving a large number of cameras. In addition, this method does not classify an object until the track of the object is finished, i.e., just when the object disappears. As result, this method does not work for applications that require real-time alerts on object types.
Among object type classifiers, mainly there are two categories of approaches used for video surveillance applications: non-learning based and learning based, which are applied to a set of selected object features. Non-learning based classifiers assume the available granularities of selected features for each of object types of interest, and calculate the distance between feature values and reference (prototype) values and make classification decision accordingly. Non-learning based classifiers are prone to be sensitive to changes in camera's setup, lighting, and image noise, and may set constraints when applied to video surveillance applications.
Learning based classifiers include supervised methods and unsupervised methods. Supervised methods (e.g., nearest neighbors, neural networks, and support vector machines) require training data for each class. The training process can be time-consuming and is required to implement offline for each of surveillance cameras. To enable an existing classifier work for changes involved in a system, such as due to camera movements, illumination conditions, video noise, or adding new object feature, a new or additional training process is required for supervised approaches. This can limit the application of learning based classifiers to edge-device based video applications that usually have restrictive resources in terms of processing power and memory.
Unsupervised methods, e.g., self-organizing map (SOM), adaptive resonance theory (ART) network do not require training data. The methods can build classifiers on the fly, and this type of classifier offers better adaptability than both supervised and non-learning approaches, but the unsupervised methods can suffer the problem of drifting in object classification, and special care is required to prevent drifting in object classification from occurring.
Published U.S. Patent Application No. 2011/0050897A1 titled “Visualizing and Updating Classification in a Video Surveillance System,” to Cobb et al., discusses a method for object classification by applying an Adaptive Resonance Theory (ART) network to the resulting nodes from self-organizing map (SOM) neural network. The SOM-ART network processes the pixel-level micro-features to adaptively learn and organize the micro-features into object types clusters. This is an unsupervised learning approach, and requires no training data. In addition to its high demands of resources in processing power and memory, this method provides no effective way to make use of the important property: a tracked object has a same object type over the scene. Moreover, it requires manual assignment to map from resultant clusters to meaningful objects types.