Understanding the 3D structure of the world from a series of 2D image observations, and in particular producing a 3D reconstruction from a sequence of 2D images, is an important undertaking in the field of computer vision. Creating a virtual 3D model from image data has applications in many fields such as but not limited to robotics, self-driving cars and augmented-reality. Augmented reality involves projecting a virtual object onto the physical (real) world around us. Virtual objects may be created from real objects, such that they can be projected into these spaces. Secondarily, for robotics, self-driving cars, and augmented-reality alike, it may be of importance to be able to know the position of a device (phone, drone, car) in the world, and 3D models of the surroundings may be helpful.
Existing approaches tend to fall into one of two categories: geometric methods and deep learning methods.
As discussed in the book by R. Hartley and A. Zisserman “Multiple view geometry in computer vision” Cambridge university press, 2003, existing geometric approaches are based on the principles of multi-view geometry. Given two or more images I1, I2, . . . IN taken at positions T1, T2, . . . , TN∈SE3 and pixel correspondences between those images, it is possible to triangulate the 3D positions of the image pixels. To determine these correspondences, it is possible to extract an image patch around a pixel and perform an exhaustive search along an epipolar line, finding the position of a similar patch in a different image. If this is done for each pixel, it is possible to produce a 2.5D depth image which contains depth information about each pixel, e.g. the distance of each pixel from the camera in a respective image.
To compute the complete 3D model, one must concatenate several 2.5D depth images together, or alternatively fuse them into a single volumetric model. In the case of the latter approach, the 3D space is split into a grid of voxels, and the content of each voxel is calculated via the following rules: if at some point a voxel is observed at a distance closer than the corresponding pixel depth, it is considered a part of a free space. Otherwise, it can be considered to be ‘occupied’.
However, this type of system is subject to erroneous pixel correspondences, which results in incorrect depth computations. Also, fusing the depth images into a single volumetric model in the manner described above is time-consuming, and consumes computer resources.
A second known approach is to use so-called ‘deep learning’, for instance as discussed in the article by C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. “3D-R2N2: A unified approach for single and multi-view 3D object reconstruction” arXiv preprint arXiv:1604.00449, 2016 and the article by D. J. Rezende, S. Eslami, S. Mohamed, P. Battaglia, M. Jaderberg, and N. Heess “Unsupervised learning of 3S structure from images”.arXiv preprint arXiv:1607.00662, 2016. In this approach, deep generative models are conditioned on the input images directly. The underlying principle in this approach is that, first, the individual 2D input images are compressed into a 1D feature vector, which summarises the content of the image. These 1D feature vectors are later passed as input to a long short-term memory (LSTM) network, the output of which is used to generate a model.
This approach is suitable for ‘imagining’ a missing part of a known object, but tends to lead to generalisation problems when modelling new unknown, observed objects.
Therefore, an approach which is less resource-intensive, less time-consuming, and which can provide a better model of unknown observed objects is required. The present disclosure describes such an approach.