Object tracking in video data is an important task and has a lot of applications such as in surveillance. The output of tracking is the state of the object in every frame. The state of the object is usually defined by its position i.e. x, y co-ordinates and the scale i.e. width and height. One of the conventional methods is multi resolution tracking in which a scale-space is created and then searched for the best location and scale. Scale space refers to generating multiple samples from a region of interest by rescaling and Low-pass filtering. These samples are then used by the tracking algorithm for searching. The scale and location where the score is the highest is the output.
For robust tracking we need to estimate both location and scale, so as to prevent drift. But, most applications require tracking to be done in real time i.e. the time taken to process one input video frame should be small. This puts the limit on the number of samples that can be searched.
To tackle this problem, many methods only estimate location and assume that the change in scale is not significant. This allows them to search at multiple locations but at only one scale i.e. fixed width and height. Hence real time performance can be achieved by searching at different locations at only one scale.
One of the methods for scale estimation is using correlation filters, see NPL 1. In NPL 1, scale estimation is formulated as a regression problem, where the filters are learnt using the target appearance and updated every frame. To solve the regression problem Fast Fourier Transform is used.
Another method for scale estimation is using latent Support Vector Machine (SVM), see NPL 2. In NPL 2, the object scale is assumed to be the latent or hidden variable and the problem is formulated as a latent SVM. The optimization is solved using an iterative co-ordinate ascent method.
In PTL 1, the object scale is estimated by calculating the 3D transformation parameters i.e. the perspective projection matrix. In this method a projection matrix is learnt to convert the 3D points in the real world to 2D points in the camera image.
PTL 2 discloses the scale estimation by calculating the contrast to variance ratio at each scale sample and selecting the maximum as the best approximation.