1. Technical Field
The present invention generally relates to object detection and, more particularly, to object detection in crowded scenes.
2. Description of the Related Art
Security incidents in urban environments span a wide range, starting from property crimes, to violent crimes and terrorist events. Many large urban centers are currently in the process of developing security infrastructures geared mainly to counterterrorism with secondary applications for police and emergency management purposes. In this context, the ability to automatically search for objects of interest, particularly vehicles, is extremely important. Recently, detection systems have been afforded the capability to answer queries such as “Show me all the two-door red vehicles in camera X from time Y to Z”. A pre-requisite to enable this capability is to accurately locate vehicles in the video images, so that attribute extraction and indexing can be performed.
Consider a goal of detecting vehicles in each frame captured by a static surveillance camera monitoring an urban environment. Urban scenarios pose unique challenges for vehicle detection. High volumes of activity data, different weather conditions, crowded scenes, partial occlusions, lighting effects such as shadows and reflections, and many other factors cause serious issues in real system deployments, making the problem very challenging. Traditional methods based on background modeling generally fail under these difficult conditions, as illustrated in FIGS. 1 and 2. FIG. 1 shows a typical crowded urban scene 100 to which event detection may be applied. FIG. 2 shows corresponding foreground blobs 200 obtained through background subtraction according to conventional event detection. Note that the prior art approach clusters groups of vehicles into the same blob.
Various models and methods have been proposed for appearance-based object detection, in particular vehicle detection. Although appearance-based detectors have achieved very good performance in challenging scenarios, they usually require tedious labeling of thousands of training samples to work well. In addition, most methods run below 15 frames per second in conventional machines, which is not desirable for large-scale surveillance systems requiring many video channels to be processed by a single machine.
Co-training and online learning methods alleviate the manual labeling issue, while constantly adapting the detector as new data comes in. However, a common limitation of these techniques is the inaccuracy in capturing online data to correctly update the classifier.
Several datasets have been proposed for learning and evaluation of vehicle detection algorithms. However, these datasets mostly include vehicle images restricted to frontal/rear and side poses and the number of vehicle images is of the order of 1000, which in our opinion, is insufficient for capturing the entire degree of variation in the appearance of cars due to changes in pose, viewpoint, illumination and scale.
Methods for occlusion handling in object detection generally rely on object part decomposition and modeling. In our application, however, these methods are not well suited due to the low-resolution vehicle images. Video-based occlusion handling from the tracking perspective has been addressed, but it assumes that objects are initially far apart before the occlusion occurs.