The advent of digital video, network cameras, and networked video recorders has resulted in a new generation of smart surveillance systems. These systems utilize analytic modules in connection with computer vision techniques to automatically extract useful information from surveillance videos. Smart surveillance systems may provide users with real-time surveillance alerts, in addition to, enabling users to easily search over surveillance data.
Visual object classification is a key component of smart surveillance systems. The ability to automatically recognize objects in images is essential for a variety of surveillance applications, such as the recognition of products in retail stores for loss prevention, automatic identification of vehicles and vehicle license plates, recognition of one or more persons of interest, etc. However, object classification using conventional techniques continues to be very challenging.
Over the past several decades, many different approaches have been proposed to automatically classify objects in images and videos. For example, bag of words and scale-invariant feature transform (SIFT) features have been popular methods for large-scale classification problems involving multiple object classes. However, these techniques are designed to handle still images with high resolutions and are not appropriate to classify moving objects in low resolution surveillance videos. See, e.g., D. Lowe, “Distinctive Image Features From Scale-Invariant Keypoints,” IJCV, Vol. 60, No. 1, pp. 91-110, 2004; and S. Lazebnik et al., “Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories,” In CVPR, 2006. Other techniques involve scanning entire video frames by applying specialized detectors, such as pedestrian or car detectors, at an image location. See, e.g., P. Viola et al., “Detecting Pedestrians Using Patterns of Motion and Appearance,” In ICCV, 2003; N. Dalai et al., “Histograms of Oriented Gradients for Human Detection,” In CVPR, 2005; and H. Schneiderman et al., “A Statistical Method for 3D Object Detection Applied for Faces and Cars,” In CVPR, 2000. However, these approaches often require excessive amounts of training data to learn robust classifiers and suffer from object pose variability.
In general, conventional object classification systems are inefficient at real-time processing and require high memory consumption. Further, conventional systems cannot handle arbitrary camera views, such as different view angles and zooms, which may cause variations in object appearance, shape, and speed; conventional classification techniques require a static camera view which allows for easy differentiation between a background image and moving objects. Conventional classification techniques also have difficulty discerning objects under various illumination conditions and have difficulty handling strong shadow effects, which may distort the size of objects. Furthermore, conventional techniques have difficulty distinguishing groups of people from vehicles, which may have similar shapes and sizes in the same camera view.