It is a quite interesting problem to track an object with abrupt motion or to track a specific target in a low frame rate video.
For the reduction of hardware cost or because of the low frame rate of a video input source or low online processing speed (for an online real-time system, the processing speed limits the frame rate of input data), etc., a lot of practical application (such as a mini embedded system required for real-time processing, some monitoring application, etc.) requires processing of the low frame rate video. The low frame rate video is common, but difficult to process in tracking.
Tracking in the low frame rate video is essentially equivalent to tracking the object with abrupt motion. The majority of tracking algorithms depend on motion continuity. The particle filter (reference [1]) uses a motion model to predict object motion and direct sampling so as to limit the search range (the distribution range of particles) within a smaller subspace, but it is difficult to accurately predict a change in the position of the target when the target abruptly moves. Other tracking algorithms based on iterative optimization, such as the mean shift algorithm (reference [2]) and Lucas-Kanade feature point tracking algorithm (reference [3]), basically requires that feature areas to be tracked have a superposed part in two adjacent frames or are very close to each other. However, these assumptions are not deduced under the condition of the low frame rate video or the target with abrupt motion.
Some researchers take notice of this difficulty (although they possibly do not want to deal with the problem of tracking the low frame rate video on purpose), and they adopt some similar solutions, i.e. they all use a detector. K. Okuma, et al. (reference [4]) use a detector trained by Boosting to combine detecting results and zero-order or first-order motion models to serve as the trial distribution of the particle filter so as to remedy the defect of inaccuracy of motion prediction. Such mixed trial distribution is also adopted in other references (e.g. reference [5]), though it is not special for solving the problem in tracking the low frame rate video. F. Porilkli and O. Tuzel (reference [6]) expand the basic mean shift algorithm to optimize multiple kernels, and the determination of the kernels depends on a detector for background differential motion area. By using the algorithm, they can track pedestrians in 1 fps video, but the premise is that the video camera is fixed. The above ideas can come down to a search process of using an independent detector to direct some existing tracker under the condition of difficulty predicting target motion.
Another kind of method is “first detection and second connection” (references [7] and [8]). The kind of method has a potential for dealing with the problem in tracking the low frame rate video, because the kind of method performs full detection of the video firstly (sometimes tracking in a short time), and then connect detected objects or tracked fragments into a complete motion track according to motion smoothness or appearance similarity. Thus, the problems of motion prediction and imperative assumption of adjacency of objects in adjacent frames are avoided. The method however has defects such that first, the process is generally performed by offline processing, because it requires comprehensive consideration of the whole track; second, the speed can hardly attain real-time requirements, because a large amount of time-consuming detection operation is required, and consequently the background differential detection is basically adopted for higher speed so that the video camera also needs to be fixed.
The above two kinds of methods have a common characteristic that a rapid-enough detector to be applied in a large area (in the whole image space in most cases) is needed, which is because the detector is in a loosely-coupled relationship with the tracking algorithm in the methods.
Some other researchers adopt a multi-scale tracking algorithm with the fundamental idea of constructing an image pyramid according to input images so as to perform observation in different scale space (references [9], [10]), and thus a larger space range can be covered when searching in larger scale space, so that a target with abrupt motion can be processed. When processing the relationships between observed quantities of different scales, G. Hua, et al. adopt a Markov network to model state quantities of different scales (reference [9]), S. Birchfield directly adopts the result of the previous scale as an initial sample of the later scale search (reference [10]), and J. Sullivan, et al. design a layered sampling algorithm to combine the observation results of different scales (reference [11]). However, the multi-scale tracking algorithm uses the same observation mode on each scale essentially.
In addition, a new trend that has appeared in the field of tracking research recently is that the researchers increasingly introduce learning methods to the tracking algorithm. Some researchers propose that the tracking problem can be considered as a classification problem, and the purpose of classification is to classify tracked objects and background or other objects. The representative work in the field includes S. Avidan's Ensemble Tracking (reference [12]), and J. Wang's online construction of a Haar character classifier by using a particle filter (reference [14]), etc. The work indicates that the learning methods greatly enhance the distinguishing capability of the tracker and improve the tracking performance.
As stated above, although there are many references for tracking research, the majority of the existing methods cannot be well applied to the problem in real-time tracking at a low frame rate. The existing methods neither have high enough processing speed nor can process the discontinuity of changes in target positions and appearance caused by the low frame rate.
Tracking methods and detection methods have been two opposite extremes for a long time, i.e., the tracking method is established on the continuity hypothesis of various time sequences (including target positions, appearance, etc.), but the detection methods independently distinguish and locate targets of some specific classes in any environments without consideration for the context.
In the low frame rate video, the continuity of time sequences of targets may be weaker, and therefore the conventional tracking method is not competent. At the same time, the full detection in the whole image space takes a lot of time, and the detection cannot distinguish different targets because of not considering the time sequences of the video.
FIGS. 1(a) and (b) each shows an example of face tracking in 5 fps video by the conventional standard particle filter tracking method and the Lukas-Kanade optical flow field tracking method, and continuous four frame images are shown. It can clearly be seen from FIG. 1 that because the continuity of time sequences of the target face is weaker, neither the standard particle filter tracking method nor the Lukas-Kanade optical flow field tracking method can well track the target face.