A stereo photography technology is a great improvement in conventional video image collection, aiming at presenting a three-dimensional browsing effect with a series of processing on a two-dimensional image sequence obtained during image collection. Therefore, an image collection result is more than a video. In addition to watching a target object passively at an image collection angle, people may further adjust a view angle actively, to watch from different directions.
Usually, a structure from motion technology may be used to recover camera parameters corresponding to each image. The camera parameters include an intrinsic matrix K and motion parameters [R|T] of a camera. R is a 3×3 rotation matrix, indicating an orientation of the camera. T is a three-dimensional translation vector, indicating translation of the camera in a scene. Any three-dimensional point x in the scene may be projected to a point x in the image by using the camera parameters:{tilde over (x)}=K(RX+T).
{tilde over (x)} is a homogenous coordinate system of the two-dimensional point x, that is, {tilde over (x)}=(x 1)T . Such a projection relationship is represented by a projection function π:x=π(K,R,T,X)If there are sufficient common points in different images, both camera parameters corresponding to each frame of image and three-dimensional positions of all scene points may be recovered by minimizing an energy function:
            arg      ⁢                          ⁢      min                      K        i            ,              R        i            ,              T        i            ,              X        j              ⁢            ∑      i        ⁢                  ⁢                  ∑        j            ⁢                          ⁢                        v          ij                ⁢                                                                                            x                  ij                                -                                  π                  ⁡                                      (                                                                  K                        i                                            ,                                              R                        i                                            ,                                              T                        i                                            ,                                              X                        j                                                              )                                                                                      2                    .                    
(Ki, Ri, Ti) are camera parameters of an ith frame, and Xj is a position of a jth three-dimensional point. If the jth three-dimensional point is visible in the ith frame, νij=1. xij is a position of the jth three-dimensional point in the ith frame. Otherwise, νij=0.
The algorithm requires feature points to be automatically extracted from images, and requires a match between image feature points that are in different images and that correspond to a same scene point. Specifically, an SIFT (scale-invariant feature transform) feature point is extracted from each image, and a 64-dimensional vector, referred to as a feature description vector, is calculated for each SIFT feature point. The feature description vector includes image information of surroundings of a feature point. In different images, feature description vectors corresponding to a same scene point are proximate. Therefore, a Euclidean distance between feature description vectors may be calculated to implement the match between image feature points that are in different images and that correspond to a same scene point. In addition, a match point between every two images needs to satisfy an epipolar geometry constraint. Therefore, a mismatch may be removed based on the constraint by using an RANSAC (random sample consensus) method.
Subsequently, according to a feature matching result, a progressive structure from motion technology is used to recover motion parameters corresponding to each image and positions of sparse three-dimensional points in the scene. For example, a system selects an image pair having a relatively large quantity of common points and a relatively long base line, estimates relative positions of cameras of the two frames of images by using a five-point method, and estimates three-dimensional positions of the common points of the two frames by using a triangulation algorithm. For remaining frame of images, if sufficient three-dimensional points whose positions are recovered are visible in a frame, camera parameters corresponding to the frame are estimated by using an efficient perspective-n-point (EPnP) algorithm, and three-dimensional points whose positions are unrecovered in the frame of image are added to the scene by using the triangulation algorithm. This step is iterated until all frames are processed. To eliminate error accumulation, after iteration is performed each time, a bundle adjustment technique may be used to jointly optimize all recovered camera parameters and three-dimensional point clouds.
The prior art provides a stereo photography technology. First, camera parameters and three-dimensional points in a scene are recovered by using the structure from motion technology, and an image whose camera parameters are proximate to those of a browsing viewpoint is selected as a source image. Subsequently, a network is created for the source image according to the three-dimensional points in the scene, and a texture mapping relationship is established according to projections of the three-dimensional points on the source image. Rendering is performed based on multiple frames of source images, and alpha blending is performed according to an angle relationship. Finally, a missing region is supplemented.
A lower portion of region at a view angle may be occluded and invisible in a photographing situation. If supplementation is performed after a final projection result is obtained in a real-time rendering phase, display efficiency is substantially affected, and fluency of real-time rendering is substantially reduced.