Inferring 3D shape from a single perspective is a fundamental human vision functionality but is extremely challenging for computer vision. Limited by the nature of deep neural network, previous methods usually represent a 3D shape in volume or point cloud, and it is non-trivial to convert them to the more ready-to-use mesh model.
“Multiple View Geometry in Computer Vision” (Hartley etc., Cambridge University Press, 2004) has disclosed 3D reconstruction based on the multi-view geometry. “Structure-from-motion revisited” (Schoenberger et al., CVPR, 2016) has disclosed structure from motion (SfM) for large-scale high-quality reconstruction and simultaneous localization and mapping (SLAM) for navigation. Both documents are restricted by firstly the coverage that the multiple views can give and secondly the appearance of the object to reconstruct. The former restriction means MVG cannot reconstruct unseen parts of the object, and thus it usually takes a long time to get enough views for a good reconstruction; the latter restriction means MVG cannot reconstruct non-lambertian (e.g. reflective or transparent) or textureless objects. These restrictions lead to the trend of resorting to learning based approaches.
Learning based approaches usually consider single or few images, as it largely relies on the shape priors that it can learn from data. Most recently, with the success of deep learning architectures and the release of large-scale 3D shape datasets such as “ShapeNet”, learning based approaches have achieved great progress. “Single-view reconstruction via joint analysis of image and shape collections” (Huang et al., ACM Trans. Graph. 34(4), 87:1-87:10, 2015) and “Estimating image depth using shape collections” (Su et al., ACM Trans. Graph. 33(4), 37:1-37:11, 2014) have disclosed how to retrieve shape components from a large dataset, assemble them and deform the assembled shape to fit the observed image. However, shape retrieval from images itself is an illposed problem. To avoid this problem, “Category-specific object reconstruction from a single image” (Kar et al., CVPR, 2015) has disclosed how to learn a 3D deformable model for each object category and capture the shape variations in different images. But the reconstruction is still limited to the popular categories and its reconstruction result is usually lack of details.
Another line of research is to directly learn 3D shapes from single images. Restricted by the prevalent grid-based deep learning architectures, most works outputs 3D voxels, which are usually with low resolutions due to the memory constraint on a modern GPU. “Octree generating networks: Efficient convolutional architectures for high-resolution 3d outputs” (Tatarchenko et al., ICCV, 2017) have disclosed an octree representation, which allows to reconstructing higher resolution outputs with a limited memory budget. However, a 3D voxel is still not a popular shape representation in game and movie industries. “A point set generation network for 3d object reconstruction from a single image” (Fan et al., CVPR, 2017) has disclosed to generate point clouds from single images, to avoid drawbacks of the voxel representation. The point cloud representation has no local connections between points, and thus the point positions have a very large degree of freedom. Consequently, the generated point cloud is usually not close to a surface and cannot be used to recover a 3D mesh directly.
“3d-r2n2: A unified approach for single and multi-view 3d object reconstruction” (Choy et al., ECCV, 2016) and “A point set generation network for 3d object reconstruction from a single image” (Fan et al., CVPR, 2017) have disclosed approaches achieved for 3d shape generation from a single color image using deep learning techniques. In these two documents, with the usage of convolutional layers on regular grids or multi-layer perception, the estimated 3D shape, as the output of the neural network, is represented as either a volume (“Choy”) or point cloud (“Fan”). However, both of them lose important surface details, and are non-trivial to reconstruct a surface model, i.e. a mesh, which is more desirable for many real applications since it is lightweight, capable of modelling shape details, easy to deform for animation, to name a few.
There is a need to provide a new and different mechanism to extract a 3D triangular mesh from a single color image.