One of the objectives of object tracking is to determine the size and location of a target object in a sequence of video frames, given the initial state of the target. This is important for a variety of applications, including the ability to efficiently and accurately track, for example, pedestrians in railway stations and airports, monitor vehicles on the road, along with faces for interfacing people and computers, and so on.
One important and yet difficult aspect of object tracking involves the automatic analysis of video data. In particular, problems are inherent in present Multi-Object Tracking (MOT) applications, which involve automatically detecting and tracking multiple objects of a known category in videos. The main paradigm for object tracking in monocular video streams is Tracking-By-Detection (TBD), which relies on a target class-specific object detector, and often boils down to optimally linking detections into tracks, a procedure known as Association-Based-Tracking (ABT). These methods directly rely on the recent progress on object detection. However, the available pre-trained detector might not always be optimal in practice.
Existing causal TBD algorithms propagate the previously detected location of a variable number of targets forward in time via target-specific appearance and motion models. Therefore, TBD depends first and foremost on an accurate object detector. An accurate appearance model might, however, not always be available in real-world applications, because of practical constraints (e.g., speed, hardware, or laws), by lack of related training data (e.g., prohibitive data collection costs), or for rare categories. Essentially, this is a typical domain adaptation problem, in which a detector pre-trained in the source domain will most likely perform sub-optimally in the target domain.