Registration and alignment of images taken by cameras at different spatial locations and orientations within the same environment is a task which is vital to many applications in computer vision and medical imaging. For example, registration between images taken by a mobile camera and those from a fixed surveillance camera can assist in robot navigation. Other applications include the ability to construct image mosaics and panoramas, high dynamic range images, or super-resolution images, or the fusion of information between the two sources.
However, because the structure of a scene is inherently lost by the 2D imaging of a 3D scene, only partial registration information can typically be recovered. In many applications, depth maps can be generated or estimated to accompany the images in order to reintroduce the structure to the registration problem.
Most currently available 2D alignment algorithms use a gradient descent approach which relies on three things: a parameterization of the spatial relationship between two images (e.g., the 2D rotation and translation between two 2D images), the ability to visualize these images under any value of the parameters (e.g., viewing a 2D reference image rotated by 30 degrees), and a cost function with associated image gradient information which allows an estimate of the parameter updates to be calculated. Among the most straightforward and earliest of these algorithms is the Lucas-Kanade algorithm, which casts image alignment as a Gauss-Newton minimization problem [5]. A subsequent refinement to this algorithm includes the inverse compositional alignment algorithm which greatly speeds the computation of the parameter update by recasting the problem, allowing all gradient and Hessian information to be calculated one time instead of every iteration [6]. Several other improvements have centered around the choice of parameters and the corresponding image warps these parameterizations induce. For example, images obtained from two identical cameras observing the same scene from a different location can be approximately related by an affine transformation or an 8-parameter homography [7].
The main problem with these types of parameterizations is that they do not truly capture the physically relevant parameters of the system, and, in the case of the homography, can lead to overfitting of the image. A more recent choice of parameters attempts to match two images obtained from a camera that can have arbitrary 3D rotations around its focal point [8]. This algorithm succeeds in extracting the physically relevant parameters (rotation angles about the focal point). However, while it is able to handle small translations, it cannot handle general translation and treats it as a source of error.
Little has been done to tackle the problem of registration of two images generated by cameras related by a general rigid transformation (i.e., 3D rotation and translation). The main reason for this is that the accurate visualization of a reference image as seen from a different camera location ideally requires that the depth map associated with that image be known—something which is not generally true. In certain situations, such as a robot operating in a known man-made environment, or during bronchoscopy where 3D scans are typically performed before the procedure, this information is known. Indeed, even in situations where the depth map is unknown, it can often be estimated from the images themselves.
An example of this is the aforementioned shape-from-shading problem in bronchoscopy guidance [9]. Current practice requires a physician to guide a bronchoscope from the trachea to some predetermined location in the airway tree with little more than a 3D mental image of the airway structure, which must be constructed based on the physician's interpretation of a set of computed tomography (CT) films. This complex task can often result in the physician getting lost within the airway during navigation [1]. Such navigation errors result in missed diagnoses, or cause undue stress to the patient as the physician may take multiple biopsies at incorrect locations, or the physician may need to spend extra time returning to known locations in order to reorient themselves.
In order to alleviate this problem and increase the success rate of bronchoscopic biopsy, thereby improving patient care, some method of locating the camera within the airway tree must be employed. Fluoroscopy can provide intraoperative views which can help determine the location of the endoscope. However, as the images created are 2D projections of the 3D airways, they can only give limited information of the endoscope position. Additionally, fluoroscopy is not always available and comes with the added cost of an increased radiation dose to the patient.
A few techniques also exist that determine the bronchoscope's location by attempting to match the bronchoscope's video to the preoperative CT data. One method uses shape-from-shading, as in [2], to estimate 3D surfaces from the bronchoscope images in order to do 3D-to-3D alignment of the CT airway surface. This method requires many assumptions to be made regarding the lighting model and the airway surface properties and results in large surface errors when these assumptions are violated. A second method of doing this is by iteratively rendering virtual images from the CT data and attempting to match these to the real bronchoscopic video using mutual information [3] or image difference [4].
While these methods can register the video to the CT with varying degrees of success, all operate very slowly and only involve single-frame registration—none of them are fast enough to provide continuous registration between the real video and the CT volume. They rely on optimization methods which make no use of either the gradient information nor the known depth of the CT-derived images, and thus require very computationally intensive searches of a parameter space.