Object tracking, also called visual tracking, is the process that detects, extracts, identifies and locates the target in a sequence of images or a video. It is a fundamental computer vision task with a wide range of real-world applications, including traffic flow monitoring, medical diagnostic, visual surveillance and human-computer interaction.
Most of the existing appearance-based tracking methods have been posed as a tracking-by-detection problem. According to the model-construction mechanism, statistical modeling is classified into three categories, including generative, discriminative and hybrid generative-discriminative. One major drawback is that they rely on low-level hand-crafted features which are incapable to capture semantic information of targets, not robust to significant appearance changes, and only have limited discriminative power. While much breakthrough has been made within several decades, the problem can be very challenging in many practical applications due to factors such as partial occlusion, cluttered background, fast and abrupt motion, dramatic illumination changes, and large variations in viewpoint and pose.
Deep learning has dramatically improved the state-of-the-art in processing text, images, video, speech and many other domains such as drug discovery and genomics, since proposed in 2006.
Especially, Convolutional Neural Networks (CNNs) have recently been applied to various computer vision tasks such as image classification, semantic segmentation, object detection, and many others.
Such great success of CNNs is mostly attributed to their outstanding performance in representing visual data. Object tracking, however, has been less affected by these popular trends since it is difficult to collect a large amount of training data for video processing applications and training algorithms specialized for object tracking are not available yet, while the approaches based on low-level handcraft features still work well in practice. Several recent tracking algorithms have addressed the data deficiency issue by transferring pretrained CNNs on a large-scale classification dataset such as ImageNet.
Although these methods may be sufficient to obtain generic feature representations, its effectiveness in terms of tracking is limited due to the fundamental inconsistency between classification and tracking problems, i.e., predicting object class labels versus locating targets of arbitrary classes.