In a sequence of frames, i.e., a video, an object can be tracked by determining correspondences of features of the object from frame to frame. However, accurately tracking a deforming, non-rigid and fast moving object continues to be a difficult computer vision problem.
Tracking can be performed with a mean-shift operator, Comaniciu et al., “Real-time tracking of non-rigid objects using mean-shift,” Proc. IEEE Conf. on Computer Vision and Pattern Recognition, volume 1, pages 142-149, 2000, and U.S. Pat. No. 6,590,999 to Comaniciu et al. on Jul. 8, 2003, “Real-time tracking of non-rigid objects using mean-shift.” A nonparametric density gradient estimator is used to track an object that is most similar to a given color histogram. That method provides accurate localization. However, that method requires some overlap of the location of the object in consecutive frames, which will not be the case for fast moving objects where the object in two consecutive frames might appear at totally different locations. Also, because the histograms are used to determine likelihood, the gradient estimation, and convergence becomes inaccurate in case the object and background color distributions are similar.
To solve this issue, a multi-kernel mean-shift approach can be used, Porikli et al., “Object tracking in low-frame-rate video,” Proc. of PIE/EI-Image and Video Communication and Processing, San Jose, Calif., 2005, and U.S. Patent Application 20060262959 by Tuzel et al, on Nov. 23, 2006, “Modeling low frame rate videos with Bayesian estimation.” The additional kernels are obtained by background subtraction. In order to resolve the above convergence issue, another kernel, which ‘pushes’ the object away from the background regions can be adapted.
Tracking can be considered as estimation of the state given all the measurements up to that moment, or equivalently constructing the probability density function of the object location. A simple tracking approach is predictive filtering. This method uses object color and location statistics while updating an object model by constant weights, Wren et al., “Pfinder: Real-time tracking of the human body,” IEEE Trans. Pattern Anal. Machine Intell., vol. 19, pp. 780-785, 1997, and U.S. Pat. No. 6,911,995 to Ivanov et al. on Jun. 28, 2005, “Computer vision depth segmentation using virtual surface. “An optimal solution is provided by a recursive Bayesian filter, which solves the problem in successive prediction and update steps.
When the measurement noise is assumed to be Gaussian distributed, one solution is provided by a Kalman filter, which is often used for tracking rigid objects, Boykov et al., “Adaptive Bayesian recognition in tracking rigid objects,” Proc. IEEE Conf. on Computer Vision and Pattern Recognition, volume 2, pages 697-704, 2000, and Rosales et al., “A framework for heading-guided recognition of human activity,” Computer Vision and Image Understanding, volume 91, pages 335-367, 2003. The Kalman filter is confined to predefined state transition parameters that control a ‘viscosity’ of motion properties of the object.
When the state space is discrete and consists of a finite number of states, Markovian filters can be applied for object tracking. The most general class of filters is represented by particle filters, which are based on Monte Carlo integration methods. A current density of a particular state is represented by a set of random samples with associated weights. A new density is then based on the weighted samples.
Particle filters can be used to recover conditional density propagation for visual tracking and verification. Generally, particle filtering is based on random sampling, which is a problematic issue due to sample degeneracy and impoverishment, especially for high dimensional problems. A kernel based Bayesian filter can be used for sampling a state space more effectively. A multiple hypothesis filter evaluates a probability that a moving object gave rise to a certain measurement sequence.
As a problem, all of the above filter based methods can easily ‘get stuck’ in local optimum. As another concern, most prior art methods lack a competent similarity criterion that expresses both statistical and spatial properties. Most prior art methods either depend only on color distributions, or structural models.
Many different representations, from aggregated statistics to appearance models, have been used for tracking objects. Histograms are popular because normalized histograms closely resemble a probability density function of the modeled data. However, histograms do not consider spatial arrangement of the feature values. For instance, randomly rearranging pixels in an observation window yields the same histogram. Moreover, constructing higher dimensional histograms with a small number of pixels is a major problem.
Appearance models map image features, such as shape and texture, onto a uniform sized window of tensors. Because of the exponential complexity, only a relatively small number of features can be used. Thus, each feature must be highly discriminant. The reliability of the features strictly depends on the object type. Appearance models tend to be highly sensitive to scale variations, and are also pose dependent.
Tracking, that is finding regions corresponding to an object in a sequence of frames, has faces similar challenges. Objects frequently change their appearance and pose. The objects can be occluded partially or completely, or objects can merge and split. Depending on the application, objects can exhibit erratic motion patterns, and often make sudden turns.
Tracking can also be considered as a classification problem and a classifier can be trained to distinguish the object from the background, Avidan, “Ensemble tracking,” Proc. IEEE Conf. on Computer Vision and Pattern Recognition, San Diego, Calif., 2005, and U.S. Patent Application 20060165258 by Avidan, filed Jul. 27, 2006, “Tracking objects in videos with adaptive classifiers.” This is done by constructing a feature vector for every pixel in the reference image and training a classifier to separate pixels that belong to the object from pixels that belong to the background. Integrating classifiers over time improves the stability of the tracker in cases illumination changes. As in the mean-shift, an object can be tracked only if its motion is small. This method can confuse objects in case of an occlusion.
Object representation, which is how to convert color, motion, shape, and other properties into a compact and identifiable form such as a feature vector, plays critical role in tracking. Conventional trackers either depend only on color histograms, which disregard the structural arrangement of pixels, or appearance models, which ignore the statistical properties. There are several shortcomings of these representations. Populating higher dimensional, histograms by a small number of pixels results in an incomplete representation. Besides, histograms are easily distorted by noise. Appearance models are sensitive to the scale changes and localization errors.
Covariance matrix representation embodies both spatial and statistical properties of objects, and provides an elegant solution to fusion of multiple features, [18] O. Tuzel, F. Porikli, and P. Meer, “Region covariance: A fast descriptor for detection and classification,” in Proc. 9th European Conf. on Computer Vision, Graz, Austria, 2006. Covariance is a measure of how much the deviation of two or more variables or processes match. In tracking, these variables correspond to point features such as coordinate, color, gradient, orientation, and filter responses. This representation has a much lower dimensionality than histograms. The representation is robust against noise and lighting changes. To track objects using covariance descriptor, an eigenvector based distance metric is adapted to compare the matrices of object and candidate regions. A covariance tracker does not make any assumption on the motion. This means that the tracker can keep track of objects even if their motion is erratic and fast. It can compare any regions without being restricted to a constant window size. In spite of these advantages, the computation of the covariance matrix distance for all candidate regions is slow and requires exponential time.
An integral image based method, which requires constant time, can improve the speed, Porikli, “Integral histogram: A fast way to extract histograms in Cartesian spaces,” Proc. IEEE Conf. on Computer Vision and Pattern Recognition, San Diego, Calif., vol. 1, pp. 829-836, 2005, and U.S. Patent Application 20060177131 by Porikli on Aug. 10, 2006, “Method of extracting and searching integral histograms of data samples.” This technique significantly accelerates the covariance matrix extraction process by taking advantage of the spatial arrangement of the points.
As many vision tasks, object detection and tracking also benefit from specific hardware implementations. Such implementations contain various combinations of different subsystems such as conventional digital signal processors (DSP), graphic processor units (GPU), field programmable gate arrays (FPGA), application specific integrated circuits (ASIC), and other reconfigurable cores. DSPs offer software programmability, which is a cost-effective. With a programmable DSP architecture, it is possible to speed up fundamental low-level algorithms. On the other hand, ASICs offer a high performance, low power, and low cost option for implementing methods, but supporting different tracking methods requires an expanding number of ASICs, leading to larger devices, greater power consumption, and higher cost. GPUs also allow construction of economical and parallel architectures. Several computational intensive processes, including contrast enhancement, color conversion, edge detection, and feature point tracking, can be offloaded to GPUs. FPGAs enable large-scale parallel processing and pipelining of data flow. FPGAs provide significant on-chip RAM and support high clock speeds. However, current on-chip RAMs are not sufficient to support a useful level of internal RAM frame buffering in object detection and tracking. Therefore, additional external memory is required to provide storage during processing of image data. The high I/O capability of FPGAs supports access to multiple RAM banks simultaneously, enabling effective and efficient pipelining.
Tracking methods have numerous issues to overcome. Likelihood score computation between the object and candidate regions is a bottleneck. Tracking methods employing histograms become more demanding as the histogram size increases. Some histogram distance metrics, e.g., Bhattacharya, and KL, are inherently complex. For covariance tracking, the likelihood computation requires extraction of eigenvectors, which is slow. Fast likelihood computation methods can significantly improve the computational speed.
Complexity is proportional to the number of the candidate regions, or the search region size. Hierarchical search methods can be applied to accelerate the tracking process. Localized search methods such as mean-shift and ensemble tracking become slower as the object size becomes larger. Adaptive scaling of the kernels and images without destroying the salient information can be adapted to achieve a real-time performance. Kernel based tracking methods becomes more demanding as the number of objects increases. Global search methods can be applied for applications that require tracking of a multitude objects. Therefore, there is a need for tracking objects with uncontrollable conditions.