Field of Invention
The present invention relates to a stereoscopic video generation method, and more particularly to a monocular-to-binocular stereoscopic video generation method based on 3D convolution neural network.
Description of Related Arts
Due to strong sense of reality and immersion, the 3D film is very popular with the audience. In recent years, the 3D film has accounted for a large share in the film market, and accounted for 14% to 21% of North American box office total revenue between 2010 and 2014. In addition, with the emergence of virtual reality (VR) market, the head mounted display also has a further demand for 3D contents.
Equipment and production costs are higher for directly producing 3D film format, so it has become a more ideal choice to convert 2D films into 3D films. A typical professional conversion process usually comprises firstly manually creating a depth map for each frame, and then combining the original 2D video frame with the depth map to produce a stereoscopic image pair based on depth map rendering algorithm. However, this process is still expensive and requires costly manpower. Therefore, high production costs become a major stumbling block to the large-scale development of 3D film industry.
In recent years, many researchers have sought to produce 3D video from a single video sequence through existing 3D model libraries and depth estimation techniques. The current depth information is able to be obtained through both hardware and software. The hardware, which has access to the depth information, comprises laser range finder and 3D depth somatosensory camera KINECT launched by MICROSOFT. Common software methods comprise multi-view stereo, photometric stereo, shape from shading, depth from defocus, and a method based on machine learning. The method based on machine learning is mainly adapted for 3D films converted from 2D films, and especially in recent years with the wide application of depth learning framework, the framework is also applied to the depth estimation. For example, Eigen et al. firstly achieves an end-to-end monocular image depth estimation through multi-scale convolution neural network (CNN). However, the size of the outputted result is limited, so it is predicted that the depth map is much smaller than the inputted original image, and the height and the width of the obtained depth map are respectively only 1/16 of the original image. Therefore, Eigen and Fergus improve the network structure later which comprises firstly up-sampling the original realized CNN output, and then connecting with the convolution result of the original input image, and then processing through multiple convolutional layers to deepen the neural network for obtaining the final outputted depth map with higher resolution.
However, the depth map obtained by the above method still has problems that the contour is not clear enough and the resolution is low. In addition, the complement problem of occlusion and other invisible parts caused by the change of view is still difficult to be solved.