Visual tracking is the problem of determining the position of a target object in each frame of a video sequence. There are two main approaches to code such information. The first and more common representation of an object location is through a bounding box, defined by the position of its four corners (as shown in FIG. 1a). This simplifies many tasks such as user selection, window adaptation and background modeling, at the cost of assuming that the center of this window is supposed to coincide with the center (of mass) of the object. Moreover, if scale and/or rotation are authorized to change, its span should coincide with the span of the object. In essence, it provides a low-order model of the object support. In a more sophisticated representation some algorithms decompose objects into interacting simple-shaped parts for the sake of relaxing rigidity constraints. Time is of course an additional dimension in such representations.
Ideally, the object changes its appearance and its shape more slowly than its location and thus its most likely position is the closest, in some feature or model space, to a template obtained from the first image and possibly updated along the sequence. In a real situation, even small deformations, for example correlation noise, occlusions, or deformation, introduce drifting effects in determining the position of the bounding box in a next frame. Moreover, considering the current position of the bounding box as the valid center of mass of the object can be misleading, such as for non-symmetric objects, and a wrong initialization for the following frame, as exemplified in FIG. 1a. Such drifts, accumulated in time, can make the assumptions described in the previous paragraph invalid and the tracker unable to recover.
A second approach to visual tracking is to formulate the problem as the spatial segmentation of the object at each frame. While being readily more precise in determining the object support, it requires an additional complexity which can be granted for some applications (for example, rotoscoping) but a killer in other contexts (for example, multi-object real-time surveillance). In terms of tracking performance however, it has been shown that the top-performing trackers are those that assume a simple-shaped representation. These representations are more robust to deformations, achieve longer stability and can run at incredible speeds.
In principle, a tracker establishes a quest for the optimal position following some cost criteria, such as minimum position error, higher correlation, best detection response, for example. This is the case for different classes of trackers such as optimal filtering based trackers, descriptor based trackers or more recently, tracking-by-detection approaches. As suggested in a recent analysis, the performance of top trackers according to recent benchmarks has shown global figures of approximately 80% for correctly tracked frames. It is customarily assumed that for a frame the object is tracked if the ground-truth bounding box and the estimated bounding box intersect each other in some proportion.
For some applications or constrained setups this number can increase. For other application settings, this is barely enough. In the present case, one of the motivations of the proposed method described herein is its application to the problem of automatic object zooming and cropping in user-generated videos, which requires high robustness, long term functioning, and on the contrary, is not demanding in terms of location precision. While Region of Interest (ROI) and saliency-based approaches for video retargeting exist, object-based approaches relying on the performance of existing generic trackers not adapted to this scenario are therefore not convincing.