Visual tracking, as one of the fundamental problems in computer vision, has found wide applications. Although much progress has been made in the past decade, tremendous challenges still exist in designing a robust tracker that can well handle significant appearance changes, pose variations, severe occlusions, and background clutters.
In order to address these issues, existing appearance based tracking methods adopt either generative or discriminative models to separate the foreground from background and distinct co-occurring objects. One major drawback of these methods is that they rely on low-level hand-crafted features which are incapable to capture semantic information of targets, not robust to significant appearance changes, and only have limited discriminative power.
Driven by the emergence of large-scale visual data sets and fast development of computation power, Deep Neural Networks (DNNs), especially convolutional neural networks (CNNs), with their strong capabilities of learning feature representations, have demonstrated record breaking performance in image classification and object detection. Different from hand-crafted features, features learned by CNNs from massive annotated visual data and a large number of object classes (such as Image Net) carry rich high-level semantic information and are strong at distinguishing objects of different categories. These features have good generalization capability across data sets. Recent studies have also shown that such features are robust to data corruption. Their neuron responses have strong selectiveness on object identities, i.e. for a particular object only a subset of neurons are responded and different objects have different responding neurons.