Real-time 3-dimensional (3D) object pose tracking is used in many computer vision applications such as Human Computer Interaction (HCI) and Augmented Reality (AR). The problem of estimating rigid pose transformation relating one 2-dimensional (2D) image to known 3D geometry has been studied intensively. Common closed form solutions need three or four 2D-to-3D point correspondences to estimate the pose. But since these solutions are based on the root of high degree polynomial equations and do not use redundancy in the data, the estimation result is susceptible to noise. Nonlinear optimization-based methods apply Gauss-Newton or Levenberg-Marquardt algorithms to the pose estimation problem. These methods rely on a good initial guess to converge to a correct solution and are generally slow to achieve convergence. The conventional iterative linear method has been developed by employing the specific geometric structure of the pose estimation problem during optimization. Techniques based on this method require little computational cost, which is appealing for real-time processing. However, all of the above conventional techniques are based solely on point correspondence, which is thereby made critical for pose tracking.
For solving conventional temporal pose tracking problems, the various methods can be divided into two groups. In the first group, the methods estimate the incremental pose changes between neighboring frames by registering a model with the image directly, which either presupposes that there are known model features whose image projection can be determined, or that there is a template image with known pose so that the registration between the template and the current image can be carried out. The main drawback is that fixed model features can be unstable in the event of visual occlusion of the tracked object or facial expression change. Further, appearance change between the template and the current image can be substantial due to varying illumination levels-thus, the registration between them becomes difficult. In the second group are differential tracking techniques, which estimate incremental pose changes via incremental motion estimation between neighboring frames. These techniques can essentially make use of arbitrary features on a model surface and do not have to model the more complex global appearance change. The main problem with these techniques is their differential character, which makes them suffer from accumulated drift. This drift limits their effectiveness in long video sequences.
Key-frames can be used to reduce motion drift in the above differential techniques. One conventional algorithm fuses online and offline key-frame information to achieve real-time stable tracking performance. There are still some limitations, however. Firstly, in case of agile motion (i.e., quick movement, often aperiodic), the feature point matching between neighboring frames becomes unreliable and can cause the tracker to fail. Secondly, when the key-frames are also obtained online, they can also have inherent drift and the drift error can propagate. Thirdly, the fusion of the previous online information and information from only one key-frame is performed in a merely heuristic manner that cannot guarantee optimal performance in the presence of image uncertainties, such as occlusion, rapid motion, illumination change, expression change, agile motion, macroscopic scale change etc.