The exemplary embodiment relates to the automatic analysis of video data and finds particular application in connection with Multi-Object Tracking (MOT), which entails automatically detecting and tracking objects of a known category, such as cars, in video streams.
Generic object detection methods have been used for predicting a limited set of candidate object locations that are likely to contain an object, whatever category it might belong to, by relying on general properties of objects (e.g., contours). For many applications, the ability to detect and locate specific objects in images provides useful information. In Multi-Object Tracking, given a video stream and a semantic class, e.g., “car” or “pedestrian,” the goal is to track individual objects in the class in the frames of the video stream as they move over time. The image regions where the objects are likely to be present are most commonly predicted by rectangles referred to as bounding boxes or windows. The windows can vary in size and aspect ratio, depending on the anticipated size and shape of the object. Object detection is a challenging task, due in part to the variety of instances of objects of the same class, to the variety of imaging conditions (viewpoints, environments, lighting), and to the scale of the search space (typically millions of candidate regions for a single frame).
Existing object detection algorithms cast detection as a binary classification problem: given a candidate window and a candidate class, the goal is to determine whether the window contains an object of the considered class, or not. This generally includes computing a feature vector describing the window and classifying the feature vector with a detector, e.g., a binary classifier, such as a linear SVM. The detector is applied in a sliding window fashion across the frame and the location with the maximal score identifies the possible new location of the target object. A sliding window may be used to scan a large set of possible candidate windows. In this approach, a window is moved stepwise across the image in fixed increments so that a decision is computed for multiple overlapping windows. For example, a HOG detector combined with a boosted cascade has been used to link person detections into tracks. See, Breitenstein, et al., “Robust tracking-by-detection using a detector confidence particle filter,” ICCV, pp. 1515-1522, 2009.
In practice, this approach uses windows of different sizes and aspect ratios to detect objects at multiple scales, with different shapes, and from different viewpoints. Consequently, millions of windows are tested per image. The computational cost is, therefore, one of the major impediments to practical implementation. There have recently been attempts to speed up the costly exhaustive search by leveraging fast to compute low-level features with cheap classifiers. For example, Hall et al., “Online, Real-Time Tracking Using a Category-to-Individual Detector,” ECCV 2014, relies on the Aggregated Channel Features of Dollár, et al., “Fast feature pyramids for object detection,” PAMI 2014 (hereinafter, Dollár 2014), and a cascade of boosted classifiers for learning individual-object detectors. The method aims to reduce the complexity of single feature extraction/classification, however, the complexity is the same as for the standard sliding window approach.
More recent object detectors have been developed which avoid exhaustive sliding window searches. Instead, they use a limited set of category-agnostic object location proposals, generated using general properties of objects (e.g., contours), and overlapping most of the objects visible in an image. These proposals are then ranked using a category-specific classifier. See, for example, van de Sande, et al., “Segmentation as selective search for object recognition,” ICCV, pp. 1879-1886, 2011 (hereinafter, “van de Sande 2011”; Cinbis, et al., “Segmentation driven object detection with Fisher vectors,” ICCV, pp. 2968-2975, 2013, hereinafter, “Cinbis 2013”; Girshick, et al., “Rich feature hierarchies for accurate object detection and semantic segmentation,” CVPR, 2014. However, such object proposals have not been adapted for tracking.
Existing MOT algorithms rely on recent improvements in the field of object detection. See, for example, Breitenstein, et al., “Online Multi-Person Tracking-by-Detection from a Single, Uncalibrated Camera,” IEEE PAMI, 33:9, pp. 1820-1333 (2011), hereinafter, “Breitenstein 2011”; Pirsiavash, et al., “Globally-optimal greedy algorithms for tracking a variable number of objects,” CVPR, pp. 1201-1208, 2011, hereinafter, “Pirsiavash 2011”; Milan, et al., “Continuous Energy Minimization for Multi-Target Tracking,” PAMI, 36:1, pp. 58-72, 2014; Geiger, et al., “3D Traffic Scene Understanding from Movable Platforms,” PAMI, 36:5, pp. 1012-1025, 2014, hereinafter, “Geiger 2014”; Hall, et al., “Online, Real-Time Tracking Using a Category-to-Individual Detector,” ECCV, 2014; Collins, et al., “Hybrid Stochastic/Deterministic Optimization for Tracking Sports Players and Pedestrians,” ECCV, 2014. Tracking-by-detection (TBD) is a standard method for object tracking in monocular video streams. It relies on the observation that an accurate appearance model is enough to reliably track an object in a video. Therefore, most MOT approaches look for the best way to link detections into tracks, thus, directly relying on object detection performance.